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ABSTRACT 


Title  of  Dissertation:  CLASSIFICATION  AND  COMPRESSION  OF  MULTI¬ 
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VECTOR  QUANTIZER  APPROACH 


Sudhir  Varma,  Doctor  of  Philosophy,  2002 


Dissertation  directed  by:  Professor  John  S.  Baras 
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Tree  structured  classifiers  and  quantizers  have  been  used  with  good  success 
for  problems  ranging  from  successive  refinement  coding  of  speech  and  images  to 
classification  of  texture,  faces  and  radar  returns.  Although  these  methods  have 
worked  well  in  practice  there  are  few  results  on  the  theoretical  side. 

We  present  several  existing  algorithms  for  tree  structured  clustering  using 
multi-resolution  data  and  develop  some  results  on  their  convergence  and  asymp¬ 
totic  performance. 

We  show  that  greedy  growing  algorithms  will  result  in  asymptotic  distortion 
going  to  zero  for  the  case  of  quantizers  and  prove  termination  in  finite  time  for 
constraints  on  the  rate.  We  derive  an  online  algorithm  for  the  minimization  of 


distortion.  We  also  show  that  a  multiscale  LVQ  algorithm  for  the  design  of  a 
tree  structured  classifier  converges  to  an  equilibrium  point  of  a  related  ordinary 
differential  equation. 

Simulation  results  and  description  of  several  applications  are  used  to  illustrate 
the  advantages  of  this  approach. 
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Chapter  1 


Introduction 


One  important  problem  in  communication  theory  is  that  of  signal  coding.  Most 
real-world  signals  of  interest  are  real-valued  scalars  or  vectors  and  most  commu¬ 
nication  systems  are  digital  in  nature  and  for  such  systems  mapping  a  real- valued 
signal  onto  a  discrete  alphabet  is  necessary  before  it  can  be  transmitted. 

Signal  coding  lies  at  the  heart  of  almost  all  modern  data  transmission  systems. 
Digital  transmission  offers  a  tremendous  advantage  over  analog  transmission  in 
noise  suppression,  signal  processing  algorithms,  error  correction  schemes  and  se¬ 
curity  issues.  The  easy  availability  and  flexibility  of  embedded  or  stand-alone 
computing  power  has  made  digital  communication  and  processing  of  signals  ubiq¬ 
uitous. 

Since,  in  general,  a  real-valued  source  cannot  be  exactly  represented  by  a  dis¬ 
crete  alphabet,  we  have  the  notion  of  a  distortion  measure  which  quantifies  the 
cost  incurred  by  representing  a  given  source  with  a  given  alphabet.  Convention¬ 
ally,  lesser  cost  implies  better  performance  and  to  find  the  optimal  alphabet  with 
the  least  distortion  for  a  given  source  is  always  our  aim.  In  succeeding  chapters, 
we  will  give  our  definition  of  a  distortion  measure  and  present  algorithms  that 
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construct  optimal  alphabets. 

Shannon  [42]  introduced  the  idea  that  quantizing  a  real  valued  signal  onto  a 
discrete  alphabet  with  the  least  amount  of  distortion  is  possible  if  this  quantization 
is  done  on  signal  blocks,  instead  of  scalar  values.  Significant  improvement  in 
distortion  over  scalar  quantization  is  possible  even  if  the  values  of  the  signal  in  a 
block  are  independent  of  each  other.  If  the  values  of  the  signal  at  subsequent  time 
points  are  not  independent,  we  gain  even  more  in  performance.  Then  there  are  also 
times  where  the  signal  itself  is  a  vector  and  quantizing  each  component  separately 
would  be  inefficient.  Such  cases  happen,  for  example,  in  image  processing  or  sensor 
arrays. 


1.1  The  quantization  problem 

As  mentioned  earlier,  the  initial  applications  of  vector  quantization  (VQ)  were 
in  the  field  of  quantization  and  coding  of  real,  vector-valued  signals.  Here  the 
objective  is  to  find  a  mapping  from  the  continuous  valued  vector  space  V  C  Wl 
to  a  discrete  alphabet  with  k  elements.  Specifically,  we  look  for  a  function  Q  : 
V  i— ►  0  =  {di,  6*2, ... ,  9k },  with  9i  e  This  gives  us  a  partition  V  =  {V;}  such 
that  Vi  =  {x  G  T>\ Q(x)  =  6i}.  For  any  x  G  V;,  the  vector  $i,  is  the  codeword 
(or  representation)  assigned  by  Q(x).  Our  goal  is  to  select  a  code-book  O  and  a 
function  Q(x)  such  that  the  average  quantization  error,  D(V,  O)  =  E(p(Q(x),x)) 
measured  by  a  positive,  convex  function  p(Q(x),x)  is  minimized.  Some  examples 
of  p  are  the  mean  squared  error  |  \x  —  0{\ |2,  the  general  rth- norm  error  (| \x  —  0,-L\ |r)r 
and  the  Itakura-Saito  distortion  (x  —  6i)TR(x)(x  —  9i )  (see  appendices  of  [9]). 

One  requirement  that  is  not  readily  apparent  from  the  above  description  is  that 
the  mapping  Q(.)  must  be  mathematically  tractable  in  the  sense  that  given  any 


2 


x ,  it  should  not  be  computationally  burdensome  to  calculate  Q(x)  nor  should  the 
memory  requirements  for  storing  this  map  be  excessive.  We  will  see  later  that  the 
optimal  mapping  is  easy  to  calculate  and  the  storage  requirements  are  limited  to 
storing  the  values  of  a  few  centroids. 

1.2  The  classification  problem 

Mapping  a  vector-valued  signal  to  a  discrete  alphabet  has  applications  other  than 
coding.  One  common  problem  in  signal-processing  is  classification  of  data  in  the 
form  of  a  vector.  Specifically,  assume  that  we  have  a  random  ordered  pair  (A",  Cx) 
where  X  is  i.i.d.  and  can  take  values  in  D  C  and  Cx  £  {1)2}  is  a  class  label 
that  can  be  one  of  two  classes.  We  have  the  prior  probabilities  7Ti  =  P(cx  =  1) 
and  7t2  =  P(cx  =  2)  and  probability  density  functions  p\{x)  =  p(x\cx  =  1)  and 
P2(x)  =  p(x\cx  =  2).  We  are  interested  in  finding  a  predictor  C(x)  for  the  class  cx, 
given  the  vector  x.  C(x)  must,  of  course,  give  a  unique  prediction  for  any  x.  In 
addition  we  would  like  to  ensure  that  the  classification  error  P{C(X)  ^  Cx)  is  as 
small  as  possible. 

Some  real  life  examples  of  this  problem  include  classification  of  radar  returns 
and  face  recognition  [3],  texture  classification  [32],  vowel  classification  [41]  and  tool 
wear  analysis  [45]. 

C(x)  is  a  mapping  from  T>  onto  {1,2}  which  partitions  the  space  T>  into  two 
areas,  V\  =  {x  :  C{x )  =  1}  and  V2  =  {x  :  C(x)  =  2}.  The  correspondence  to 
the  quantization  problem  described  in  the  previous  section  is  clear.  Instead  of 
mapping  each  x  to  a  representative  to  minimize  the  expected  distortion  p(., .)  we 
map  it  to  a  class  to  minimize  a  classification  error. 

Finding  a  C(x)  that  minimizes  the  classification  risk  is  easy  if  we  know  7Ti,  7T2,pi(x) 
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and  P2(x).  But,  in  most  cases  we  only  have  a  training  sample  in  the  form  of  a  se¬ 
quence  of  observation  vectors  and  corresponding  class  labels  (aq,  cQ,  i  =  1,  2, ,  n. 
In  Chapter  3  we  will  present  the  popular  LVQ  algorithm  for  constructing  a  decision 
function  C(x )  using  training  samples. 

1.3  Tree  structured  quantizers  and  classifiers 

Quantization  and  classification  can  both  be  seen  as  decision-making  algorithms 
in  the  sense  that  given  a  vector  x  we  have  to  choose  one  of  several  codewords  or 
classes  to  assign  to  the  vector. 

Tree  Structured  Vector  Quantizers  (TSVQ)  offer  faster  lookup  speed  compared 
to  other  quantizers  and  classifiers.  With  k  codewords  or  classes,  the  TSVQ  execu¬ 
tion  time  is  0(log(/c))  compared  to  0{k )  for  an  ordinary  quantizer.  This  improve¬ 
ment  comes  with  a  decrease  in  performance,  but  with  a  carefully  designed  TSVQ, 
this  decrease  can  be  minimized. 

TSVQ  operate  on  the  principle  of  successive  refinement  of  knowledge.  It  does 
a  progressive  classification  where  an  initial  rough  classification  is  done  followed  by 
hirer  and  finer  partitions  that  improve  the  performance. 

Multi-Resolution  TSVQ  (MRTSVQ)  is  an  improvement  on  TSVQ  in  cases 
where  we  can  obtain  the  observation  vector  in  multiple  levels  of  detail.  Coarse 
representations  can  be  used  to  do  the  initial,  rough  classification  and  more  and 
more  detail  can  be  added  for  finer  discrimination. 


4 


1.4  Summary  of  new  work 


In  this  thesis  we  will  present  a  greedy  growing  algorithm  for  constructing  tree 
structured  classifiers  and  quantizers  for  multi-resolution  data.  For  the  quantization 
case  we  show  that  under  some  conditions  the  distortion  of  the  tree  goes  to  zero 
asymptotically  as  the  number  of  iterations  increases  without  bound. 

For  any  tree  structured  clustering  method,  we  assign  a  new  vector  to  its  appro¬ 
priate  cluster  by  starting  out  with  the  root  as  the  current  node.  Then  the  children 
of  the  current  node  are  compared  with  the  given  vector  to  find  the  closest  node. 
This  node  now  becomes  the  new  current  node  and  the  process  repeats  until  we 
reach  a  leaf  node.  The  cluster  corresponding  to  this  leaf  node  is  assigned  as  the 
cluster  to  which  the  vector  belongs.  The  expected  number  of  comparisions  with 
child  nodes  is  a  measure  of  the  expected  computational  time  it  will  take  to  assign 
a  new  vector  to  its  cluster.  We  will  call  this  the  rate  of  the  tree.  It  is  easy  to  see 
(and  we  will  prove  it  later)  that  as  the  tree  grows,  the  rate  of  the  tree  increases. 

Now  assume  that  we  stop  the  algorithm  when  the  rate  of  the  tree  gets  larger 
R.  Then  we  will  show  that  the  algorithm  stops  in  finite  time. 

We  also  derive  the  form  of  the  online  algorithm  for  minimization  of  distortion 
that  was  implied  in  [2]  for  the  case  of  two  centroids  and  show  how  it  can  be 
extended  for  the  general  iF-centroid  case. 

For  the  classification  case  we  present  a  multi-scale,  multi-level  LVQ  algorithm 
that  adapts  each  level  of  the  tree  simultaneously  to  converge  to  a  classifier  that 
seeks  to  reduce  the  Bayes  risk.  We  show  that  the  adaptation  algorithm  behaves 
like  a  corresponding  ordinary  differential  equation  and  converges  to  the  global 
equilibrium  of  the  ODE. 

Finally  we  present  several  simulation  results  that  illustrate  the  implementation 
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of  these  algorithms  and  one  example  of  the  application  of  these  clustering  algo¬ 
rithms  to  the  prediction  of  tool  wear  from  acoustic  emissions.  We  also  give  several 
heuristics  for  practical  implementation  of  these  algorithms. 


1.5  Arrangement  of  thesis 

Chapter  2  presents  background  on  the  motivation  and  methodology  for  use  of 
VQ  and  TSVQ  for  compression  and  classification.  We  present  several  well  known 
results  on  algorithms  and  achievability  of  successive  refinement  codes  for  vector 
data. 

In  Chapter  3,  we  present  our  definition  of  a  Multi- Resolution  Analysis  while 
Chapter  4  will  deal  with  the  MRTSVQ  algorithm  as  it  applies  to  compression  and 
some  results  on  asymptotic  distortion,  termination  and  online  algorithms. 

In  Chapter  5  we  describe  the  multi-scale  LVQ  algorithm  for  classification  us¬ 
ing  an  MRTSVQ  and  present  some  results  pertaining  to  the  convergence  of  this 
algorithm. 

We  end  with  simulations  and  descriptions  of  several  applications  of  MRTSVQ 
in  Chapter  6. 

1.6  Notation 

We  will  use  script  letters  £?, . . .)  to  denote  transforms  of  signals.  Calligraphic 
letters  (V,  Si} . . .)  will  be  used  for  sets  and  the  set  of  real  numbers  will  be  denoted 
by  M.  Symbols  starting  with  d  will  usually  denote  the  dimension  of  vector  spaces. 
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Chapter  2 


Compression,  Classification  and 
Trees 

2.1  VQ  for  compression 

As  mentioned  earlier,  vector  quantization  was  initially  proposed  for  the  purpose  of 
quantizing  a  real- valued  signal  onto  a  finite,  discrete  alphabet.  Given  a  real- valued 
source,  there  are  an  infinite  number  of  alphabets  that  it  can  be  mapped  onto;  each 
such  mapping  producing  a  code  of  some  information  rate  R  and  suffering  some 
distortion  D.  Rate-Distortion  theory  [11]  shows  that  the  set  of  all  possible  couples 
(. R ,  D)  has  a  convex  hull  called  the  achievable  rate- distortion  limit.  Points  on 
the  hull  correspond  to  alphabets  that  have  either  the  minimum  rate  for  a  given 
distortion  or  the  minimum  distortion  for  given  rate. 

This  mapping  can  be  done  on  a  point-to-point  basis  where  the  output  of  the 
source  at  each  time  point  is  mapped  onto  a  letter;  or  it  can  be  done  in  a  block-wise 
fashion  where  a  block  of  data,  k  time  steps  long,  is  mapped  onto  a  letter.  Except 
for  rare  cases,  the  rate-distortion  limit  is  achieved  asymptotically  as  k  — »  oo.  This 
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holds  true  even  if  the  output  of  the  source  is  independent  from  one  time-step  to 
the  next.  As  Cover  and  Thomas  put  it,  “It  is  simpler  to  describe  an  elephant  and 
a  chicken  with  one  description  than  to  describe  each  alone.”  [11,  page  336] 

This  also  means  that  if  the  source  is  itself  multi-variate  and  outputs  a  vector 
valued  signal  at  each  time,  quantizing  the  output  using  a  vector  quantizer  is  more 
efficient  than  quantizing  each  component  separately. 

2.1.1  Problem  definition 

Consider  a  signal  consisting  of  a  sequence  {xi},Xi  E  Md,  with  Xi  i.i.d.  and  dis¬ 
tributed  according  to  a  probability  density  p(x).  We  want  a  mapping  Q(x)  :  Rd  i— > 
0  =  {6*i,  02, . . .  Ok}-  Here  0  is  a  discrete  alphabet  with  k  letters  {Oi},  6i  E  Md. 

In  addition,  we  have  a  distortion  measure  p{6,  x)  which  quantifies  the  cost 
incurred  in  representing  x  by  6.  The  expected  value  of  this  distortion  measure  is 
denoted  as 

D(Q,P)  =  Ep(p(X,Q(X ))) 

Given  a  probability  density  p(.),  our  goal  is  to  find  a  quantizer  (V,  0)  that 
minimizes  the  distortion  D(V,  0).  In  the  next  subsection  we  will  present  the 
Linde,  Buzo  and  Gray  (LBG)  algorithm  that  iteratively  finds  a  locally  optimal 
quantizer. 

2.1.2  Iterative  local  minimization  of  distortion 

Linde,  Buzo  and  Gray  [29]  show  that  the  following  conditions  are  necessary  for  a 
quantizer  to  minimize  the  cost  D(V ,  0): 


1.  For  a  given  reproduction  alphabet  0,  the  partition  V  =  {V;},  where 

Vi  =  {x\  p(9,  Xi)  <  p(6,Xj),j  =  1,2,...,  A:,  j  ±  i}  (2.1) 

has  an  expected  error  that  is  not  greater  than  any  other  partition 

2.  For  a  given  partition  V,  the  reproduction  alphabet  0  =  {dj},  where 

9i  —  argrnin  J  p(9,x)p(x)dx  (2.2) 

gives  an  expected  error  that  is  not  greater  than  any  other  alphabet.  We  call 
9i  the  centroid  of  Vt. 

These  two  conditions,  which  are  a  generalization  of  the  Lloyd-Max  algorithm 
[30],  [31]  for  the  design  of  scalar  quantizers,  can  be  used  in  an  iterative  fashion  in 
the  LBG  algorithm  to  obtain  a  partition  and  an  alphabet  that  is  locally  minimal 
for  D(V,  0)  [29], 

Condition  (1)  is  called  a  Nearest  Neighbor  or  Voronoi  or  Dirichlet  partition.  It 
should  be  noted  that  we  need  to  store  only  the  values  of  the  k  centroids  9t  to  fully 
characterize  this  quantizer. 

If  p(x,  y)  =  j \x— y\ j2  is  the  squared  error  distance  and  we  have  only  two  centroids 
9 1  and  92,  then  the  Nearest  Neighbor  condition  gives  a  partition  {Vi,  V2}  =  {V  fl 
H,V  fl Hc}  where  H  —  { x  :  j  \x  —  9\ |  |2  <  | \x  —  92\ |2}.  Thus  the  partition  is  achieved 
by  a  hyper-plane  | \x  —  9\ \  \  =  |  \x  —  92 \  \ . 

2.1.3  Asymptotic  behavior  for  high  resolution  quantizers 

Assume  that  p(9,x)  =  ||x  —  9\\2.  Then  it  is  easy  to  show  that  for  a  distribution  p 
of  the  random  variable  X  such  that  A{||A^||2}  <  oo,  the  distortion  for  the  optimal 
quantizer  can  be  made  arbitrarily  small  by  choosing  a  large  enough  number  of 
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centroids  k.  Indeed,  for  any  5  >  0,  choose  a  ball  of  radius  r  <  oo  centered  on  zero 
such  that  8{||a;||J  :  ||a;||2  >  r2}  <  S2/ 2.  The  finite  variance  of  A"  implies  that  such 
an  r  exists.  Now  divide  this  ball  using  a  partition  composed  of  planes  perpendicular 
to  each  axis,  separated  by  a  distance  of  52  / \fd  where  d  is  the  dimension  of  X.  Then 
8(0,  r)  is  partitioned  by  a  finite  set  of  cuboids  of  side  52/\fd,.  Assign  the  center 
point  of  each  cuboid  as  a  centroid  to  all  points  lying  in  the  cuboid.  Since  no 
point  in  the  ball  is  farther  than  52  /  2  from  its  centroid,  the  distortion  due  to  points 
lying  in  the  ball  is  less  than  52  / 2.  For  all  points  lying  outside  the  ball,  assign 
the  origin  as  the  centroid.  Then  the  distortion  contributed  by  these  points  is 
8{||a;||2  :  ||a;||J  >  r2}  <  52/ 2.  Thus  the  total  distortion  is  less  than  S2.  This  gives 
the  result. 

This  result  also  shows  that  the  minimum  distortion  with  k  number  of  centroids, 
8(V,  0)  — >  0  as  k  — >  oo. 

2.2  Vector  classification 

As  mentioned  in  the  previous  chapter,  classification  of  vectors  is  a  problem  that 
is  closely  related  to  quantization.  In  quantization,  we  are  interested  in  finding  a 
mapping  between  a  vector  x  e  Wl  onto  a  discrete  alphabet  0  that  minimizes  a 
distortion  measure  between  x  and  its  quantized  value.  In  classification,  we  assume 
that  X  is  an  observation  that  depends  on  some  class  C\  and  we  need  to  find  a 
mapping  C(X)  from  X  onto  a  discrete  alphabet  that  minimizes  a  classification 
error  between  the  actual  class  Cx  and  the  prediction  C. 
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2.2.1  Bayes  risk  and  Bayes  optimal  classifier 

The  most  common  case  is  when  both  C'x  and  C( X)  take  values  in  the  same  alpha¬ 
bet.  Then,  the  Bayes  risk  is  one  quantitative  measure  of  classification  performance 
[24,  Ch.  7.2],  Let  us  denote  7Ti  =  P{CX  =  1}  and  7r2  =  P{CX  =  2}  the  a-priori 
probabilities  of  observing  a  vector  of  a  particular  class  and  pi(x)  and  P2(x)  the 
a-posteriori  probability  densities  for  vector  x  given  that  class  1  or  2  was  observed. 
For  the  case  where  we  have  two  classes  the  signal  space  is  split  into  two  disjoint 
regions  Vi  =  {x\C(x)  =  1}  and  V2  =  {x\C(x)  =  2}.  The  Bayes  risk  is  then,  the 
probability  of  misclassification 

-B(Vi,V2)=  I  7i2P2(x)dx  +  I  7ripi(x)dx  (2.3) 

Jv  1  Jv2 

The  Bayes  optimal  classifier  is  defined  as  a  partition  that  minimizes  the  cost  (2.3). 
It  is  easy  to  see  that 

Vi  =  {x  :  7TiPi(x)  >  7r2p2(^)} 

V2  =  {x\  IT 2p2(x)  >  TTiPiix)}  (2.4) 

is  such  a  partition. 

2.2.2  Learning  Vector  Quantization  (LVQ) 

We  have  already  mentioned  that  the  problem  with  trying  to  implement  the  Bayes 
optimal  classifier  is  that  we  need  to  know  the  a  priori  probabilities  tti  and  7 r2 
and  the  a  posteriori  probability  densities  pi(x)  and  p2(^)-  In  almost  all  cases  of 
interest,  all  we  have  available  is  a  set  of  training  vectors  along  with  their  classes 
{( xn,cn )  :n  =  1,2, . .  ,,N}. 

One  solution  to  this  problem  is  Learning  Vector  Quantization  (LVQ),  first  pro¬ 
posed  by  Kohonen  [24],  LVQ  is  a  clustering  method  for  approximating  the  Bayes 
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regions  with  the  Nearest  Neighbor  partitions  of  some  set  of  centroids  0  =  {(9*}  and 
a  distance  function  p(9,x).  The  algorithm  starts  with  an  initial  set  of  centroids 
along  with  corresponding  classes  for  their  partitions  and  iteratively  updates  the 
centroids  and  their  classes  according  to  the  training  sequence  ( xn,cn ). 

The  update  algorithm  is  as  follows 

9i(n  +  1)  =  9i(n )  —  a(n )  Ve  p(&ii  xn),  if  cn  is  equal  to  class  of  9 j 

9i(n  +  1)  =  9i(n)  +  a(n )  Ve  p(#i,  xn ),  if  cn  is  not  equal  to  class  of  9i  (2.5) 

for  all  9i  in  a  neighborhood  of  xn.  Usually,  the  only  9i  updated  in  each  iteration  is 
the  one  nearest  to  xn.  a(n )  is  a  learning  parameter  that  determines  how  the  past 
observations  are  weighted  with  respect  to  the  present.  If  p(9,x)  —  ||x  —  9 1|2,  then 
\/ep(9,x)  =  2(9  —  x),  and  it  is  easy  to  see  that  the  9i  is  pushed  towards  or  away 
from  xn,  depending  on  whether  the  classes  are  equal  or  different.  After  a  cycle  of 
updates  in  this  way,  the  assigned  class  of  all  centroids  are  updated  according  to 
the  majority  class  in  its  partition. 

ft  seems  reasonable  that  if  an  equilibrium  exists  for  the  centroids  in  the  above 
algorithm,  the  total  “push”  on  the  centroids  by  vectors  belonging  to  a  different 
class  must  be  exactly  balanced  by  the  total  “pull”  by  the  vectors  of  the  same  class. 
Making  this  intuition  more  precise,  [26]  shows  that  under  some  conditions  on  the 
learning  parameter  a(n ),  the  LVQ  algorithm  converges  to  centroids  9i  such  that 

/  (nipi(x)  -  Tr2p2(x))  Ve  p(9i,x)dx  =  0,\/i  (2.6) 

JVi 

with  Vi  the  Nearest  Neighbor  partition  for  d*. 

In  general,  [26]  shows  that  the  behavior  of  the  LVQ  algorithm  approximates 
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that  of  the  ordinary  differential  equation  (ODE) 


6i  =  /  (7Tipi(x)  —  vr2j92(a;))  xjg  p(Oi)x)dx  for  all  0*  of  class  1 
Jvi 

0 i  =  /  (vr2j92(a;)  —  7TiPi(x))  \/g  p(Oi)x)dx  for  all  of  class  2 
JVi 

2.3  Tree  Structured  Vector  Quantizers 

One  way  of  getting  reduced  algorithmic  complexity  with  the  same  amount  of  com¬ 
putational  resources,  with  little  degradation  in  performance  is  a  Tree  Structured 
VQ  (TSVQ).  TSVQs  operate  on  the  principle  of  successive  refinement  of  knowledge 
[16].  In  the  compression/quantization  case,  this  takes  the  form  of  a  progressive 
code  for  x  [8].  A  coarse  partitioning  in  the  upper  layers  of  the  tree  results  in  the 
most  significant  bits  of  the  code.  Further  refinement  in  the  lower  levels  of  the  tree 
gives  the  less  significant  bits. 

In  the  case  of  classification,  successive  refinement  means  that  in  the  beginning 
we  do  a  broad  classification  at  a  higher  level;  followed  by  refining  of  the  classifica¬ 
tion  at  the  lower  levels  until  we  get  a  sufficiently  low  level  of  error.  The  classifier 
is  in  the  form  of  a  tree  where  at  each  of  the  nodes  a  test  is  done  that  determines 
which  child  node  it  is  classified  into  [8].  In  this  way  a  test  vector  ends  up  at  a  leaf, 
each  of  which  is  associated  with  a  class. 

2.3.1  TSVQ  for  compression 

Using  a  tree  structured  quantizer  offers  several  advantages.  In  the  case  where  the 
number  of  letters  in  the  codebook  is  very  large  (see  [3]  for  example),  searching 
through  all  the  centroids  of  an  ordinary,  single  level  quantizer  to  find  the  nearest 
centroid  might  be  computationally  expensive.  For  k  centroids,  this  search  takes 
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0(k)  time  which  might  not  be  feasible  in  real-time  applications.  But  this  process 
is  highly  parallclizablc  and  a  parallel  computation  of  all  the  k  distances,  followed 
by  a  parallel  search  can  bring  down  the  time  complexity  to  0(log(k)).  But  this 
requires  more  computing  resources.  A  TSVQ  will  achieve  the  same  decrease  in 
computational  effort  without  additional  processors.  In  many  cases,  the  degradation 
in  fidelity  for  the  same  rate  can  be  minimized. 

Another  advantage  of  a  TSVQ  is  that  it  naturally  results  in  a  successive  re¬ 
finement  code.  The  structure  of  the  code  is  such  that  a  few  bits  give  a  coarse 
description  of  the  source  and  the  rest  of  the  bits  provide  more  and  more  details. 
This  is  useful  in  when  multi-media  data  has  to  be  sent  over  communication  links  to 
users  with  differing  capacity.  Using  the  same  codebook,  users  with  high  capacity 
can  enjoy  high-fidelity  audio  and  video  while  those  with  low  capacity  links  can 
trade  off  rate  for  data  that  lacks  detail,  but  is  still  intelligible. 

The  second  way  in  which  multi-rate  codes  can  be  useful  is  when  a  user  might 
want  to  skim  through  several  images  quickly  and  is  not  interested  in  high-resolution, 
but  might  want  to  pick  out  some  of  the  images  to  look  at  in  higher  detail. 

Which  one  of  the  codes,  high  or  low  rate  should  we  optimize  first  in  the  first 
case?  The  answer  depends  on  factors  like  the  relative  capacities  of  the  two  classes 
of  users  and  how  much  each  is  paying  for  the  multi-media  service.  The  answer  is 
easier  in  the  second  case;  we  would  want  to  find  the  optimal  low-rate  codebook 
and  fix  it,  and  then  try  to  find  the  high-rate  codebook  that  is  optimal  with  the 
given  constraint. 

Under  what  conditions  is  it  possible  to  optimize  the  codebooks  for  each  rate 
simultaneously?  This  question  is  addressed  in  the  next  section. 
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2.3.2  Achieving  successive  refinement 

Successive  refinement  is  defined  as  a  higher  rate  (and  lower  distortion)  code  that 
attains  the  rate-distortion  (R-D)  limit  which  is  formed  by  appending  bits  to  a 
lower  rate  (and  higher  distortion)  code  that  itself  attains  the  R-D  limit.  Suppose 
that  we  are  given  a  real-valued  random  variable  X  and  that  we  use  two  encoders 
to  create  a  low-rate  encoded  signal  Xx  of  rate  R\  and  expected  distortion  Di,  and 
a  higher-rate  X2  of  rate  R2  and  distortion  D2.  Then  we  will  say  that  X\  and  X2 
are  successive  refinements  of  X  if 

1.  X2  can  be  obtained  from  X\  by  appending  to  it  (R2  —  R\)  bits  on  the  average. 

2.  (i?i,  D\)  and  ( R2,D2 )  attain  the  R-D  limit. 

Note  that  the  bits  that  are  appended  to  X\  to  obtain  X2  are  derived  from  X  and 
represent  additional  information  that  is  not  present  in  X\. 

Equitz  and  Cover  discuss  the  achievability  of  a  successive  refinement  code  in 
[16].  They  show  that  for  a  code  to  be  a  successive  refinement  code,  it  is  necessary 
and  sufficient  that  the  sequence  X\  — >  X2  — >  X  is  Markov. 

If  this  condition  is  not  satisfied,  the  code  will  not  be  a  successive  refinement 
code.  In  such  a  case,  Rimoldi  [39]  characterizes  the  set  of  all  (Ri,Di),  ( R2,D2 ) 
pairs  that  are  achievable.  He  also  extends  this  to  general  L-resolution  codes. 

Tree  structured  vector  quantizers  are  a  way  of  creating  multi-resolution  code¬ 
books  that  are  very  efficient,  though  suboptimal.  There  are  various  ways  of  cre¬ 
ating  a  tree  and  they  are  detailed  in  [8].  Here,  we  will  briefly  summarize  different 
approaches  and  then  present  a  greedy  tree  growing  algorithm 


15 


2.3.3  Construction  of  TSVQ 

Constructing  a  TSVQ  for  compression  can  be  done  by  either  the  top  down  or  the 
bottom  up  approach  [13].  In  the  top  down  approach,  we  start  with  a  root  node  that 
contains  the  whole  vector  space.  Then  this  root  node  is  split  (and  the  vector  space 
partitioned)  to  form  children.  These  children  are  further  split  and  this  process 
continues  until  the  desired  rate  is  reached.  In  this  approach,  we  optimize  the  low- 
rate  codebook  first  and  then  find  the  high-rate  codebook  that  is  optimal  given  the 
restriction  on  the  low-rate  code.  These  kinds  of  hierarchical  codebooks  are  useful 
when  most  of  the  traffic  is  low-rate  and  high-rate  is  the  exception. 

The  other  approach,  bottom  up,  is  the  exact  opposite.  We  start  with  a  set  of 
leaf  nodes  that  partition  the  space  and  then  we  recursively  merge  nodes  to  form 
a  tree.  Here,  the  performance  of  the  high-rate  code  is  more  efficient  compared  to 
that  of  the  low-rate  code. 

[13]  also  shows  an  approach  where  different  weights  are  given  to  the  rate  and 
distortion  at  different  levels  and  then  the  algorithm  tries  to  minimize  the  sum  of 
weighted  distortions  and  rates.  This  is  a  generalization  of  which  the  top  down  and 
bottom  up  approaches  are  special  cases. 

In  this  thesis,  we  will  be  concerned  with  a  kind  of  top  down  approach  called 
greedy  growing.  Greedy  algorithms  are  one  of  the  most  commonly  used  methods 
in  hierarchical  codebook  design.  They  are  simple  to  implement  and  provide  suffi¬ 
ciently  good  performance  in  many  cases.  The  next  subsection  gives  the  details  of 
the  greedy  growing  algorithm. 


16 


2.3.4  Definitions 


Clusters  in  a  space  can  be  regarded  as  a  partitio7i  of  the  space.  In  the  rest  of  this 
thesis,  we  will  assume  that  each  partition  of  a  set  T>  C  is  characterized  by  a 
set  of  centroids  9 1,62, ...  ,9k  G  Md.  The  partition  itself  is  created  as  the  union  of 
k  cells ,  Vi,  i  =  1,2, ...  ,k.  Cell  V,  corresponds  to  centroid  0;.  To  derive  the  cells 
from  the  centroids,  we  need  to  fix  a  distance  criterion  p(x,  y )  between  any  two 
points  x,  y  G  Wl.  In  the  remainder  of  this  thesis,  we  will  fix  p(x,  y )  as  the  squared 
Euclidian  distance  p(x,y)  =  ||x  —  y ||2. 

Given  this  distance,  we  derive  the  cells  V,  as  follows 

V;  =  {x  :  p(x,  Qi)  <  p(x,  9j),  j  =  1,  2, . . . ,  k,  j  ±  i] 

Note  that  each  cell  is  associated  with  a  cluster  and  all  points  belonging  to  that 
cell  belongs  to  the  corresponding  cluster.  Thus,  given  a  new  vector  x ,  we  assign 
it  the  cluster  corresponding  to  the  cell  it  belongs  to.  This  is  done  by  Ending  the 
centroids  6,,  among  9i,...,9k  that  is  closest  to  it  in  the  distance  metric  p(x,6). 
Denote  by  9X  this  particular  centroid  that  is  closest  to  x  and  Vx  the  corresponding 
cell,  then  x  is  said  to  belong  to  the  cluster  associated  with  cell  Vx. 

For  the  case  of  unsupervised  clustering,  the  distance  measure  also  doubles  as 
a  distortion  measure.  p(x,  y)  not  only  measures  the  distance  between  x  and  y, 
but  also  gives  the  error  made  when  x  is  represented  by  y.  A  valid  criterion  for 
choosing  one  particular  partition  out  of  the  infinite  possibilities  is  to  minimize  the 
expected  distortion  E{p(X,9x)}-  Here  the  expectation  is  computed  with  respect 
to  the  probability  density  p(x)  on  X.  There  are  other  ways  of  defining  criteria  for 
unsupervised  clustering  [37]. 

Let  9  =  9*  =  c(V,p )  attain  the  minimum  for  D(V,9)  =  E{p(X,9)\X  G  V} 
given  the  probability  density  p.  If  p(., .)  is  the  Euclidian  distance  metric,  this  point 
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is  unique  and  belongs  to  the  convex  hull  of  V.  We  call  this  9*  as  the  generalized 
centroid  of  V  and  denote  the  minimum  value  by  D*(V)  =  D(V,9*).  In  what 
follows  we  will  not  explicitly  mention  the  probability  density  p  unless  there  is  any 
ambiguity. 

Now  assume  that  we  have  two  centroids  9\,92  and  correspondingly,  two  cells 
Vi,  V2  such  that  Vi  U  V2  =  V.  The  resultant  distortion  dne  to  this  partition  of  V 
is  D(Vi,9i)  +  D ( V2 5  $2) •  For  fixed  Vi,  V2,  this  quantity  will  be  the  minimum  and 
equal  to  ZT(Vi)  +  D*(V 2)  if  9 1  and  92  are  equal  to  the  generalized  centroids  of  Vi 
and  V2  respectively.  We  denote  the  decrease  in  distortion 

A d{v,  Vi,  v2)  =  ir(v)  -  (/;*(>’,)  +  zr (v2)) 

There  will  be  at-least  one  partition  {V*,  V2 }  which  is  optimal  in  the  sense  that 
for  any  other  partition  the  decrease  in  distortion  is  equal  to  or  greater  than  that 
achieved  by  {V{,V2}. 

2.3.5  Greedy  tree  growing 

The  greedy  tree  growing  algorithm  starts  out  with  a  root  node  that  is  associated 
with  a  cell  T>  C  Wl.  Then,  recursively,  all  leaf-nodes  are  examined  to  find  the  one 
that  will  give  the  biggest  reduction  in  distortion  when  split.  This  incrementally 
optimal  node  is  split  and  the  process  is  repeated  until  a  desired  rate  is  reached. 

We  have  a  splitting  algorithm  -0  that  splits  the  cell  corresponding  to  any  given 
node  on  the  tree.  For  a  node  on  the  tree  and  the  corresponding  cell  V,  0  will 
compute  a  partition  {Vi,  V2}  of  V  that  gives  a  decrease  in  distortion  A D(V,  Vi,  V2) 
which  is  close  to  the  optimal. 

Given  this  the  algorithm  is  as  follows: 

1.  Fix  a  splitting  algorithm  0 
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2.  Initialize  the  tree,  T  =  c(V,p) 


3.  For  any  leaf  node  tj  belonging  to  the  set  of  leaf  nodes  T  of  the  tree  with 
corresponding  cell  U ,  use  ^  to  find  a  partition  {U \ ,  U2 }  • 

4.  Compute  AD(U,  U\ ,  U2) 

5.  Find  the  leaf  node  tj*  with  the  maximum  value  for  AD{U,IAi,IA-2) 

6.  Implement  the  split  for  tj* 

7.  If  stopping  criterion  is  reached,  stop.  Else,  repeat  3-6 

The  stopping  criterion  can  be  of  several  types.  Most  common  ones  are  con¬ 
straint  on  rate,  number  of  leaf  nodes,  maximum  value  of  distortion,  number  of 
iterations  and  so  on. 

In  [34]  it  is  shown  that  if  the  growth  of  the  tree  is  continued  indefinitely,  the 
distortion  of  the  tree  goes  to  zero  while  [35]  gives  conditions  under  which  the 
algorithm  for  a  tree  grown  with  a  stopping  criterion  on  rate  eventually  terminates. 

2.3.6  Tree  structured  classifiers 

Tree  structured  classifiers  are  similar  to  the  “if-then-else”  rules  in  case-based  rea¬ 
soning.  We  have  an  observation  vector  consisting  of  features  that  are  indicative 
of  the  class  of  a  source.  These  features  can  be  real-valued,  discrete  or  categorical. 
Classification  takes  place  in  a  step  by  step  manner  where  at  each  node  in  the  tree 
one  or  a  combination  of  more  than  one  features  are  used  to  classify  the  observation 
vector  into  one  of  the  child  nodes.  This  is  repeated  until  a  leaf  node  is  reached 
and  a  class  label  is  assigned  to  the  observation  depending  on  the  node. 
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Another  way  of  interpreting  tree  structured  classifiers  is  that  each  node  divides 
the  input  vector  space  into  as  many  disjoint  regions  as  it  has  child  nodes.  Thus 
the  input  space  is  hierarchically  partitioned  into  disjoint  sets;  each  set  has  a  class 
label  associated  with  it  and  any  observation  vector  falling  in  it  is  classified  with 
that  label. 

Construction  of  tree  structured  classifiers  can  be  done  in  the  greedy  fashion 
that  we  presented  for  tree  structured  quantizers.  We  start  with  a  root  node  that 
contains  all  of  the  input  space.  Then,  recursively,  each  leaf  node  is  examined  to 
find  the  one  that  gives  the  biggest  decrease  in  classification  cost  if  split.  This 
optimal  node  is  then  split  to  produce  the  improved  tree.  This  process  is  repeated 
until  the  stopping  criterion  is  satisfied. 

Some  classification  costs  that  have  been  used  for  comparing  classifiers  are  the 
Bayes  risk,  the  entropy,  the  Gini  index  and  the  generalized  Neyman-Pearson  cost 
which  is  a  weighting  of  the  probabilities  of  false  positives  and  false  negatives  [9],  [26]. 
Some  common  stopping  criterion  are  expected  depth  of  search  and  average  classi¬ 
fication  error. 

In  [8] ,  the  authors  present  several  algorithms  for  classification  of  data  that  are 
either  numerical  or  categorical.  They  also  tackle  problems  where  the  dimensional¬ 
ity  of  the  observation  vectors  is  not  fixed. 

The  problem  with  creating  classifiers  by  successive  partitioning  is  that  at  each 
node  we  have  to  find  a  split  that  is  optimal  in  reducing  the  classification  error. 
With  N  elements  in  a  node  there  are  a  possible  2N  possible  partitions.  Searching 
through  all  possible  partitions  becomes  impossible  even  for  small  amounts  of  data. 
In  [9],  the  authors  derive  optimal  partitioning  rules  for  creating  tree  structured 
classifiers.  They  show  that  under  a  wide  variety  of  conditions,  it  is  possible  to  find 
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the  optimal  partition  in  O(N)  time.  For  any  “impurity”  measure  for  the  node, 
they  show  that  minimization  of  the  impurity  is  equivalent  to  the  nearest  neighbor 
condition  for  a  specific  distance. 
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Chapter  3 


Signal  space  and  multi-resolution 
analysis 


A  multi-resolution  analysis  (MRA)  is  a  way  of  obtaining  representations  of  a  signal 
in  increasing  levels  of  resolution.  MRAs  can  be  obtained  in  various  ways.  A 
truncated  Fourier  series  is  one  simple  example.  Affine  wavelet  transforms  and 
wavelet  packets  offer  powerful  and  flexible  procedures  for  obtaining  MRAs.  An 
interesting  circumstance  where  signals  are  analyzed  in  multiple  resolutions  is  in 
biological  systems  ([1],  [47]),  as  mentioned  before.  Before  we  give  a  definition  of 
an  MRA  and  provide  examples,  we  must  define  the  space  of  signals  that  we  are 
considering. 

3.1  Signal  space 

Consider  functions  x(t)  :  [0, 1)  — >  R  defined  on  the  interval  r  =  [0, 1)  with  a  norm 
defined  as 

| |x|  j2  =  f  x2(t)  dt 

Jo 
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We  will  be  dealing  with  the  space  of  all  functions  of  finite  norm 

j0'2 (t )  =  {x(t),t  G  [0, 1)  :  |  |x|  |2  =  f  x2(t)dt  <  00} 

Jo 

Since  our  distance  metric  between  two  elements  of  £2(t)  is  the  norm  of  the 
difference  ||xi  —  a;2j|2,  we  will  actually  be  working  in  the  equivalence  class  of  el¬ 
ements  that  are  equal  in  the  mean  square  norm  sense.  This  means  that  we  will 
not  discriminate  between  two  elements  X\  and  x^.  if  |  \x\  —  x2|  |2  =  0.  For  example, 
elements  that  are  equal  almost  everywhere,  but  not  point-wise  equal,  will  fall  into 
the  same  equivalence  class. 

3.2  Definition  of  MRA 

Given  this  signal  space,  we  define  an  MRA  as  a  sequence  of  sets  S,  C  £2(t)  such 
that 

1.  S0  C  S2  C  53  C  . . . 

2.  U“0^  =  A(r) 

3.  x(t )  =  0  G  So 

4.  Each  Sj  is  spanned  by  a  set  of  di  basis  functions  <f>\(t),  k  =  1,  2, . . . ,  d*  such 
that  any  x(t)  G  S,;  can  be  written  as 

di 

x(t)  =  5 ^ck(x)<j)l(t ) 
k= 1 

for  a  unique  set  of  weights  Cfc(x). 

For  any  x(t)  G  £2(t)  we  denote  the  projection  onto  the  space  S,  by  S^ix(t)  G  S,: 
where 

S^ixit)  =  arg  min  I \x(t)  —  y{t) ||2 
y(t)eSi 
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Since  S^ix{t)  G  Si,  we  can  find  weights  ck{x)  such  that 

di 

y%x{t)  =  Y ck{x)(t)ik 

k= 1 

Note  that  if  x(t)  G  St  then  5?ix(t)  =  x(t)  so  the  notation  for  the  weights  ck(x)  is 
consistent. 

In  what  follows,  we  will  assume  that  x  and  (f)lk  are  functions  of  t  G  C^r) 
and  not  make  it  explicit.  If  we  denote  c(x)  =  [ci(x),  c2(x), . . . ,  c^x)]7^  and  <f>*  = 
[0}, . . . ,  00]T  we  can  write  the  above  as 

y'iX  =  cT(x)d>1 

Note  that  <50  x  is  a  linear  projection  and  is  not  necessarily  an  orthogonal  basis. 

For  a  given  MRA,  denote  by  C,  C  the  set  of  all  possible  weight  vectors;  i.e. 
Ci  =  (c(x)  :  x  G  d>,}.  Then  ,50  is  a  one-to-one  mapping  from  the  set  of  functions 
Si  to  the  set  of  weights  Cj.  For  the  norm  ||x||  for  any  x  G  S,  we  can  write 

||x||2  =  cT  (x)Ric(x) 

where  Ri  is  a  di  x  di  matrix  whose  (■ m,n)th  element  is  given  by 

c,n=  C  vnmn{t)  dt  = 

Jo 

Since  Ri  is  a  symmetric,  positive-definite  matrix  we  can  find  a  non-singular  matrix 
Wt  such  that  Ri  =  WjWi.  This  implies  that 

|  Wl2  =  ||IFic(x)||-> 

where  the  latter  norm  is  the  ordinary  squared  norm  in  a  di  dimensional  space. 
Similarly,  the  distance  between  the  projections  onto  S,  of  any  two  elements  Xi(t) 
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and  X2 (t)  belonging  to  £2^ )  can  be  computed  as  1  —  2^2 1|2  =  ||Wic(xi)  — 
W/ic(x2)||2  We  will  denote 

Xi  =  Wjc(x ) 

Thus  for  all  x  G  Si,  ||a:j|2  =  ||xj|||. 


3.3  Wavelets 

Since  Fourier  first  showed  that  signals  could  be  decomposed  into  projections  along 
orthogonal  basis  functions,  harmonic  analysis  has  come  a  long  way.  Windowed 
Fourier  transforms  were  studied  by  Gabor  [20]  while  the  first  wavelet  finds  mention 
in  the  thesis  of  Haar  [22] .  Mallat  discovered  several  relationships  between  quadra¬ 
ture  mirror  filters,  pyramid  algorithms  and  orthonormal  wavelet  bases.  Meyer  [33] 
constructed  the  first  continuously  differentiable  wavelets  while  Daubechies  [12]  con¬ 
structed  a  class  of  wavelets  with  compact  support  with  arbitrarily  high  regularity. 

In  wavelet  analysis,  a  signal  x(t)  at  resolution  i  is  decomposed  into  a  weighted 
sum  of  a  series  of  functions 

OO 

x(t)= 

k=— 00 

The  most  important  fact  is  that  for  all  i ,  k,  (j>\(t)  is  derived  from  the  same  scaling 
function  <f>(t)  by  dilation-s  and  translations. 

4>k(t)  =  0(2 't  -  k) 

might  not  be  orthogonal  across  scale  i  or  translation  k.  We  will  discuss  the 
cases  of  orthogonal,  bi-orthogonal  and  semi-orthogonal  basis  functions  later  in  this 
section. 
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The  set  of  all  basis  functions  <plk,  i  =  . . . ,  —2,  —1,  0, 1,  2, . . .  for  a  given  i  span  a 
space  denoted  by  St.  We  have 

•  •  •  C  C  5_i  C  S0  C  cSi  C  . . . 

The  fact  that  S;  C  <Sj+1  implies  that  there  exists  weights  a3  such  that 

OO 

(3.1) 

j=- OO 

This  is  true  of  any  MRA  (not  necessarily  wavelet  generated)  and  gives  a  recursion 
from  which  it  is  frequently  possible  to  reconstruct  (pit),  for  a  given  set  of  coefficients 

dj. 

Denote  by  VV,  the  space  of  all  signals  that  complement  St  to  make  np  <Si+1  so 
that 

•S  ,  ©  W,;  = 

where  for  any  two  sets  U,W,  we  denote  U  ©  V  =  {u  +  v  :  u  G  U,v  G  V}. 
In  addition  to  the  scaling  function  (p(t)  we  also  have  a  function  ip[t)  called  the 
wavelet  whose  dilations  and  translations  form  a  basis  on  the  space  VV,.  Since  any 
function  belonging  to  VV,;  also  belongs  to  Sl+ 1 ,  we  have  weights  Wj  such  that 

OO 

i’kif)  =  w^)  (3-2) 

j=~  OO 

Note  the  similarity  to  3.1. 

Much  work  has  been  done  on  constructing  wavelet  bases  that  have  orthogonal¬ 
ity,  compact  support  and  symmetry.  Orthogonality  makes  it  easy  to  decompose  a 
signal  into  a  weighted  sum  of  basis  functions.  Compact  support  is  desirable  be¬ 
cause  it  results  in  finite  impulse  response  (FIR)  filters  for  the  decomposition  rather 
than  infinite  impulse  response  (HR)  filters.  FIR  filters  are  easier  to  implement  us¬ 
ing  common  signal  processing  circuits  than  HR  filters.  Finally,  symmetry  of  the 
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basis  functions  results  in  linear  phase  filters  that  are  necessary  to  avoid  undesirable 
phase  distortions. 

Unfortunately,  these  three  attributes  are  mutually  antagonistic.  Compactly 
supported,  orthogonal  wavelet  functions  are  not  symmetric.  Bi-orthogonal  and 
semi-orthogonal  wavelets  are  a  compromise  that  offer  compactness  and  symmetry 
in  exchange  for  some  complexity  in  the  decomposition  procedure. 

3.3.1  Semi-orthogonal  wavelets 

Semi-orthogonal  wavelets  differ  from  orthogonal  wavelets  in  that  there  are  two 
scaling  functions  <f  and  phi.  <f]  is  not  orthogonal  across  translation  j  but  we  have 

(<&,%)  =  W,k) 

i.e.  <j>]  is  orthogonal  to  all  (flk,  k  ^  j.  Thus  it  is  possible  to  calculate  the  coefficients 
for  the  decomposition 

OO 

x(t)  = 

k= — oo 

for  x  G  Si  as 

Ci,k  =  (x,  4>l) 

Another  property  of  semi-orthogonal  wavelets  is  that  the  space  spanned  by 
<f],j  =  1,  2, . . .  is  the  same  as  the  space  spanned  by  cf] ,  j  =  1,2,...  i.e.  St.  d*  is 
called  the  dual  scaling  function.  We  also  have  a  dual  wavelet  such  that 

=  8(j,k) 

which  spans  the  space  W/.  if]  and  if]  are  called  the  synthesis  and  analysis  wavelet 
respectively. 

Semi-orthogonal  MRAs  offer  compactly  supported,  symmetric  synthesis  wavelets 
but  the  analysis  wavelets  are  not  compactly  supported. 
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3.3.2  Bi-orthogonal  wavelets 

Like  semi-orthogonal  MRAs,  bi-orthogonal  wavelets  have  dual  wavelets  and  scaling 
functions.  But  the  difference  here  is  that  the  dual  scaling  functions  span  a  subspace 
Si  ^  Si  and  the  dual  wavelets  span  a  subspace  W;  ^  W.  Furthermore  we  do  not 
have  Si  orthogonal  to  VV,  as  we  had  in  the  case  of  orthogonal  and  semi-orthogonal 
MRAs. 

Bi-orthogonal  MRAs  offer  compactly  supported,  symmetric  analysis  and  syn¬ 
thesis  wavelets  and  scaling  functions;  something  that  neither  orthogonal  nor  semi- 
orthogonal  wavelets  can  provide. 

For  more  details  on  the  construction  and  use  of  orthogonal,  semi-orthogonal 
and  bi-orthogonal  wavelets  see  [21]. 

3.4  Biological  filters 

Hierarchical  processing  of  signals  have  been  observed  in  the  primary  auditory  cor¬ 
tex  (Al)  of  the  mammalian  brain  [47].  The  auditory  cortex  receives  the  input  from 
the  inner  ear  which  computes  a  spectrogram  of  the  sound  that  impinges  on  the 
ear. 

In  the  Al,  the  neurons  are  arranged  in  a  2D  map.  Neurons  are  arranged  in 
order  of  selectivity  to  increasing  frequencies  along  one  axis  of  the  2D  map.  This 
is  the  so  called  tonotopic  axis.  Thus  sounds  of  a  particular  frequency  will  excite 
neurons  around  a  particular  region  on  the  tonotopic  axis. 

Much  research  has  gone  into  determining  what  features  are  presented  along  the 
other  axis  of  this  2D  map.  Researchers  have  identified  three  characteristics  that 
vary  along  the  second  axis.  They  are 


1.  symmetry  of  the  spectrogram, 


2.  bandwidth  of  the  spectrogram  and 

3.  direction  of  FM  sweep. 

Along  the  second  axis,  neurons  exhibit  a  continuous  gradation  in  the  kind  of  sym¬ 
metry  they  are  attuned  to.  Starting  with  neurons  that  are  selective  to  spectrograms 
with  higher  energy  in  frequencies  above  the  Base  Frequency  (BF),  the  selectivity 
grades  to  neurons  that  are  selective  to  symmetric  spectrograms  up  to  neurons  se¬ 
lective  to  spectrograms  with  higher  energy  in  frequencies  below  BF  (See  Fig.  3.1). 


Tonotopic  axis  x 

Figure  3.1:  Features  arranged  along  the  second  axis  in  Al 

The  bandwidth  of  the  spectra  that  the  neurons  are  selective  to,  also  changes 
from  narrow  bandwidth  in  the  center  of  the  axis  to  broad  bandwidths  at  the  ends. 

Thirdly,  neurons  at  one  end  are  selectively  attuned  to  chirps  with  downward 
moving  frequency  and  neurons  at  the  other  end  are  attuned  to  upward  moving 
FM  chirps,  while  neurons  in  the  middle  are  equally  responsive  to  chirps  in  both 
directions. 

Wang  and  Shamma  [47]  propose  a  multi-resolution  signal  processing  scheme 
to  account  for  these  observations.  A  seed  function  h(x)  is  used  to  model  the 
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sensitivity  profile  of  a  neuron.  Given  any  function  h(x),  we  can  find  symmetric 
and  anti-symmetric  functions  he  and  hQ  such  that 

h  =  he  +  hQ 

The  unique  he  and  hQ  are  given  by 

km  = ft(x) 

K(x)  =  =Ji zfi 

Then  we  can  produce  a  function 

ws{x\(j)c)  =  he(x)  cos  (j)c  -  hQ(x)  sin  (f>c 

that  continuously  grades  from  antisymmetric  in  one  direction  to  symmetric  to 
antisymmetric  in  another  direction  when  0C  goes  from  —  n/2  to  7t/2.  This  form 
for  ws  was  chosen  so  that  the  magnitude  of  the  Fourier  transform  of  ws(x  :  4>c) 
is  a  constant  independent  of  <f>c .  The  authors  also  show  that  this  is  effective  in 
accounting  for  FM  selectivity. 

This  takes  care  of  the  variation  in  symmetry  and  FM  selectivity  of  the  response. 
To  model  the  variation  in  bandwidth,  h(x)  is  dilated  according  to  hs(x )  =  h(asx ) 
for  a  fixed  parameter  a  (Usually  a  =  2).  This  gives  us  two  variables;  (f)c  which 
models  the  symmetry  and  s  which  models  the  scale.  Thus,  in  this  model  of  the 
primary  auditory  cortex,  the  spectrum  of  the  input  sound  is  analyzed  along  three 
axes,  the  center  frequency  x,  the  symmetry  0C  and  scale  (or  bandwidth)  s.  Fig  3.2 
shows  this  pictorially. 

In  Chapter  6  we  will  present  an  application  of  this  model  when  we  use  it  to 
extract  multi-resolution  features  that  will  help  us  estimate  the  wear  on  a  milling 
tool  from  its  acoustic  emissions. 
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Figure  3.2:  Axes  of  analysis  of  spectrum  in  Al 

3.5  Probability  densities  on  ^(r) 

We  assume  that  we  have  a  probability  density  P  on  £2  (t).  One  interpretation  of  a 
probability  density  on  an  infinite  dimensional  space  is  that  the  signals  x(t)  are  the 
outputs  of  a  stochastic  process.  Given  this  density,  we  can  also  find  the  density  in 
the  di  dimensional  space  of  ith  resolution  representations  of  all  x  e  £2(t),  i.e.  Si. 

Another  assumption  for  the  remainder  of  this  thesis  is  that  M00(P)  =  EP{\  |x|  |2)  < 
oo.  In  other  words,  the  expected  power  of  the  signal  must  be  finite.  The  fact  that 
S^iX  is  the  projection  of  x  onto  the  space  S,  implies  that  ,$^tx  is  orthogonal  to  the 
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error  —  x)  which  gives 


Mt(p)  =  E{\\yiX\\2} 

<  e{\\x - ylX\\2}  +  E{\\yiX\\2} 

=  e{\\x  —  yiX\\2}  +  E{\\yiX\\2}  +  ((x  —  y'iX),y,iX) 

=  e{\\x  —  y*iX  +  s^iX\\2} 

=  E{\\X\\2}  =  M^P)  <oo 

Mt(P)  <  oo  for  all  i  means  that  for  any  V  C  5,;,  the  problem  of  finding  the  centroid 
that  minimizes  the  distortion  is  well  defined  and  has  at  least  one  solution. 

The  actual  multi-resolution  analysis  done  on  the  signals  plays  a  very  important 
role  in  good  classifier  and  quantizer  performance.  It  must  be  capable  of  picking 
out  features  that  are  most  significant  to  the  performance,  in  the  coarse  level.  Some 
algorithms  for  selecting  an  MRA  for  a  specific  problem  are  given  in  [10],  [38]  and 
[49], 

3.6  Splitting  to  reduce  distortion 

Given  a  probability  density  p  on  C2  (t),  we  can  calculate  the  distortion  due  to  all 
the  points  in  a  cell  V  C  C2{r)  represented  by  point  6  G  V  as 

D(V,6)  =  [  ||x-0|| 2dP  (3.3) 

Jv 
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If  9  £  Sk,  i.e.  9  can  be  expressed  as  a  linear  combination  of  the  dk  basis  functions 
of  Sk,  then  we  can  express  3.3  as 


D(V,6)  =  /  \\x-ykx  +  ykx-9\\2dP 

Jv 


\x  -  ykx\\2  + 1|  ykx  -  djj2  -  2((x  -  ykx),  (ykx  -  o))dP 


\x-ykx\\ 2dP+  /  \\ykx-e\\2dp 


(3.4) 


JV  JV 

The  last  equality  follows  since  (d7'kx  —  9)  is  a  linear  combination  of  the  orthonormal 
basis  functions  of  Sk  to  which  (x  —  S^kx)  is  orthogonal. 

Denote  Dk(V,9)  =  fv\\J^kx  —  9\\2dP  and  Dk(V)  =  jv\\Sfikx  —  x\\2dP  = 
Ep{\\S^kX  —  X\ | J}.  Dk(V .  9)  is  the  component  of  the  distortion  that  changes  with 
9.  Dk(V)  is  the  orthogonal  component  of  the  distortion  that  is  independent  of  9 
and  depends  only  on  the  resolution  k.  For  any  V,  Dk(V)  — >  0  as  k  — >  oo. 

The  above  shows  that  to  find  a  /e-resolution  centroid  9  to  minimize  distortion, 
we  need  only  look  at  the  dk  dimensional  space  of  ^-resolution  representations  of 
all  signals  x  G  V.  In  other  words, 


arg  min  D(V,  9)  =  arg  min  Dk(V,  9) 


(3.5) 


The  minimum  value  that  D(V,  9)  can  take  for  any  9  is  Dk(V),  when  Dk(V,  9)  = 


0. 


3.6.1  Projections  of  cells  in  multiple  resolutions 

Let  us  define  how  cells  in  one  resolution  are  carried  over  to  another  resolution.  Let 
and  be  the  representations  at  two  resolutions.  Assume  that  we  have  a  cell 
Vi  in  resolution  i.  Then  the  corresponding  cell  Vj  in  resolution  j  is 

Vj  =  {y'jX  :  y'iX  £  Vi,  x  £  C2 (r)} 


33 


Note  that  if  we  take  a  cell  in  high-res  space  and  project  it  to  a  low-res  space  and 
then  project  it  back  to  the  high-res  space  we  might  not  necessarily  end  up  with 
the  original  cell.  This  happens  because  more  than  one  point  in  the  high  resolution 
space  is  projected  onto  the  same  point  in  the  low  resolution  space. 

Any  convex  cell  Vk  C  5^k  in  the  kth  resolution  is  projected  onto  a  convex  cell 
at  all  resolutions.  To  see  this,  consider  the  projection  of  Vk  onto  jC2(t)  defined  as 
Voo  =  {x  G  c2{r)  :  ykx  G  14}.  For  any  x,  y  G  14o  we  have  ykx,yky  G  Vx.  Then 
the  convexity  of  Vk  gives 

yk{ax  +  (1  —  a)y)  =  a<ykx  +  (1  —  a)yky  G  14,  for  any  a  G  [0, 1] 

This  implies  that  ax  +  (1  —  a)y  €  14o  which  shows  that  Voo  is  convex. 

For  any  i,  Vt  =  {^x  :  Ykx  G  Vk}  =  {^x  :  x  G  Voo}-  For  any  ^%y  G  Vt 
we  have  x,  y  G  Voo  and  ax  +  (1  —  a)y  G  Voo  which  implies  that 

5?i(ax  +  (1  —  a)y)  =  a^x  +  (1  —  a)S?iy  G  V; 

which  implies  that  V,  is  convex.  Thus  convex  cells  in  one  resolution  are  projected 
onto  convex  cells  in  any  other  resolution. 
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Chapter  4 


Compression  using 
Multi- resolution  TSVQ 


As  we  mentioned  earlier,  MRTSVQ  offers  several  advantages  over  unembellished 
TSVQ.  To  do  the  coarse  quantization  in  the  top  layers  of  the  tree  structured 
codebook,  it  is  frequently  adequate  to  use  a  coarse  representation  of  the  signal. 
This  in  turn  enables  us  to  do  distance  calculations  with  far  less  computations  than 
if  the  vectors  were  at  the  highest  resolution.  This  computational  advantage  is  very 
important  if  the  number  of  letters  in  the  codebook  is  very  large.  Furthermore, 
storing  the  centroids  corresponding  to  the  nodes  of  the  tree  requires  less  memory. 

In  this  chapter  we  will  present  the  greedy  growing  algorithm  for  MRTSVQs 
and  establish  some  useful  properties  of  this  algorithm. 

4.1  Preliminaries 

As  briefly  mentioned  in  earlier  chapters,  the  goal  of  compression  is  to  quantize 
a  real  valued,  random  signal  into  a  finite  number  of  indexed  values.  Then  these 
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indices  can  be  transmitted  or  stored  instead  of  the  original  signal.  Let  X  E  C2  (t) 
be  a  random  variable  that  is  distributed  according  to  some  density  p(x)  as  detailed 
in  the  previous  chapter.  Note  that  X  is  not  a  scalar  or  a  vector,  but  a  function 
defined  on  [0, 1). 

We  want  to  find  a  mapping  Q  :  £2(t)  — > ►  {6\ ,62, _ ,0k}  where  0,  E  £2 (t). 

Thus,  given  an  x  E  £2  (r),  Q(x)  will  be  a  quantized  version  of  x.  The  error 
between  x  and  its  quantization  Q(x)  is  given  by  a  distortion  function  p(x,Q(x)). 
p(., .)  is  non-negative  everywhere  and  convex.  Some  examples  of  p  were  given  in 
previous  chapters. 

In  what  follows,  we  will  assume  that  p(x,y )  =  ||x  —  y  ||2  =  J0*  (x  —  y)2  dt.  For 
x,y  E  Si  denote 

di 

x = Y,^x)m = 

3= 1 

y  =  ^2c){yWj{t)  =  c\y)T<&(t) 

3= 1 

Then  we  can  write  \  \x  —  y\\2  =  ||WjC*(x)  —  Wict{y) |||  where  Wi  is  a  non-singular 
matrix  that  depends  on  the  basis  functions  at  resolution  i  and  ||.||2  is  the  ordinary 
squared  norm  of  a  vector.  Thus,  the  distortion  between  two  functions  in  St  is 
equivalent  to  the  squared  error  between  linear  transforms  of  their  MRA  coefficients. 
If  we  denote 


x*  =  Wi?(x) 

f  =  Wic\y )  (4.1) 

then  the  distortion  between  x  and  y  in  the  £2('r)  space  is  equal  to  the  squared 
error  between  xl  and  yl  in  the  space.  The  fact  that  it  is  easier  to  calculate 
the  latter  distortion  compared  to  the  former  is  what  makes  MRTSVQ  much  faster 
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than  TSVQ.  This  also  allows  us  to  extend,  to  C2{r)  results  and  algorithms  that 
apply  to  For  any  C  C  Si,  we  denote  C  =  {xl  :  x  €  C}  C  Mdi.  In  what  follows 
we  will  denote  x  =  xl  whenever  it  is  not  necessary  to  make  explicit  the  resolution 
of  the  decomposition. 

If  we  represent  all  x  G  C  C  St  with  one  representative  6  G  Si,  the  expected 
distortion  is  given  by 

D(C,6)  =Ep{\\X-6\\2  :X  e  C}  =  J\\x~0\\22dx 

Denote  by  9*  the  representative  that  minimizes  this  distortion.  As  mentioned  in 
the  previous  chapter,  we  call  this  the  generalized  centroid  of  C.  With  the  squared 
error  distortion,  the  centroid  is  unique  and  lies  within  the  convex  hull  of  C.  With 
other  distortion  measures,  these  properties  might  not  hold. 

Now  assume  that  we  split  C  using  two  representatives  Q\  and  62 ■  The  LBG 
algorithm  presented  in  Chapter  2  shows  that  the  partition  with  the  least  distortion 
must  be  two  cells  separated  by  a  hyper-plane  given  by 

{x  £  C  :  ||x  —  #i||2  =  ||x  —  #2||2} 

and  the  partition  is  {C\ ,  C2}  =  {C  D  H,  C  fl  Hc}  where 

H  —  {x  :  \\x  —  9i\\2  <  ||x  —  1 1 2}- 

Denote  by  C\  the  cell  corresponding  to  9\  and  C2  the  cell  corresponding  to  62. 
Then  C1UC2  =  C.  Let  (0*,  0%)  be  a  set  of  two  centroids  that  minimize  the  distortion 

D(Ci,  61  +  D(C2,  92)  —  I  \\x  —  6i\\22dx  +  I  \\x  —  92\\ldx 

JC\  JC2 

where  all  x  is  the  transformed  projection  of  the  signal  in  the  current  resolution 
and  C  =  {x  :  x  G  C}. 
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In  contrast  to  the  case  where  we  only  had  one  centroid,  here  there  can  be  more 
than  one  set  of  centroids  (91,  9%)  that  minimize  the  above  distortion.  For  example, 
take  a  probability  distribution  in  C  that  is  radially  symmetric.  Then  any  rotation 
of  an  optimal  centroid-couple  will  given  another  optimal  centroid-couple. 

The  decrease  in  distortion  when  going  from  one  centroid  to  two  is  given  by 

AD(C,Ci,C2)  —  f  \\x  —  9*\\\dx  —  I  ||x  —  9\\\\ dx  —  f  Wx  —  d^Wldx 
Jc  Jci  Jc2 

This  quantity  plays  an  important  role  in  deciding  which  leaf  of  the  MRTSVQ  to 
split. 

Before  proceeding,  we  need  to  present  some  supporting  results  that  will  be  used 
later  in  this  chapter.  For  these  lemmas,  assume  that  the  cell  U  G  St  so  that  we 
can  consider  the  space  U  C  of  transformed  coeffients  as  in  4.1. 

Lemma  1  For  any  cell  U  G  St  and  hyper-plane  H ,  we  have  A D(U,  H)  >  0. 

Proof: 

D*{U)  =  min  /  | \x  —  9\ \2p(x)dx 

=  min(  /  \\x  -  e\\lp(i)di  +  f  ||4-e|||p(i)(ii) 

0  JunH  JunHc 

>  min  /  ||x  —  §\\2p(x)dx  +  min  /  \\x  —  6\\2p(x)dx 

6  JunH  o  JunHc 

=  D*(UFH)+D*(UFHC) 

=y  A D(U,Ui,U2)  =  D*(U)  -  (D*(UFH)  +  D*(UnHc))  >  0 

This  shows  that  any  split  will  improve  or  leave  unchanged  the  total  distortion 
of  a  cell. 

Lemma  2  Assume  a  cell  U  C  Sk  with  an  absolutely  continuous  density  and  non¬ 
zero  probability  that  is  split  into  two  cells  U1M2  by  61,62  belonging  to  the  set  of 
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optimal  centroids.  Then  we  have  AD(14M\M2)  >  0  and  P{U\ )  >  0  and  P{U2)  > 

0. 


Proof:  Denote  by  9p  the  centroid  of  U.  Since  the  probability  density  of 
is  absolutely  continuous,  we  can  find  a  partition  Vi,Vo  such  that  9P  is  not  one  of 
the  centroids  of  V2.  This  can  be  done  by  selecting  V2  such  that  9P  lies  outside  the 
convex  hull  of  V2 .  Let  92  be  a  centroid  of  V2  •  Then  we  have 

A D(U,9)  >  D(U,9p)-D(V1,9p)-D(V2,92) 

=  D(V2,9p)-D(V2,92) 

>  0 

The  last  inequality  holds  because  9p  is  not  a  centroid  of  V2.  Since  we  can  find 
at-least  one  9\  and  92  such  that  AD  >  0,  the  result  holds. 

Without  loss  of  generality,  assume  P{U\ )  =  0.  Then  P{U2 )  =  P{U).  This 
implies  that  the  distortion  contributed  by  the  elements  in  U\  is  zero,  which  shows 
that  the  minimum  distortion  D*(U2)  =  D*(U).  This  gives  AD  =  0  which  is 
contradicted  by  the  previous  result. 

4.2  Greedy  growing  for  MRTSVQ 

The  greedy  growing  algorithm  for  MRTSVQ  is  very  similar  to  that  for  the  ordinary 
TSVQ.  The  only  difference  is  that  here  we  need  a  rule  that  tells  us  whether  to  split 
a  given  node  at  the  current  resolution  or  to  go  to  a  higher  resolution.  One  rule 
used  in  [3]  is  to  go  to  the  next  higher  resolution  when  the  decrease  in  distortion 
obtained  by  splitting  at  the  current  level  is  lesser  than  some  fixed  fraction  of  the 
total  distortion  of  the  node.  There  can  be  many  other  possibilities.  d(L)  will 
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denote  the  dimension  of  the  signal  representation  at  node  f*.  We  will  denote  by 
0  the  rule  used  to  determine  the  resolution  in  which  a  node  at  any  given  depth  is 
split. 

This  rule  takes  into  account  our  tradeoff  between  computational  ease  at  the 
lower  resolution  and  higher  potential  decrease  in  distortion  at  a  higher  resolution. 
The  computational  burden  increases  linearly  with  the  dimension  of  the  vectors 
while  the  potential  decrease  in  distortion  on  going  from  resolution  k  to  k  +  1 
decreases  with  increasing  k  for  large  k.  Thus  a  reasonable  rule  would  try  to  utilize 
as  much  of  the  information  in  the  lower  resolution  as  possible  before  going  onto  a 
higher  resolution.  Also  if  there  is  zero  potential  decrease  in  the  current  resolution, 
this  rule  will  try  to  find  at  least  one  higher  resolution  k  where  the  decrease  in 
distortion  is  greater  than  zero. 

We  also  have  a  splitting  algorithm  -0  that  splits  a  given  node  to  maximize  the 
decrease  in  distortion.  A  reasonable  splitting  algorithm  would  start  with  a  good 
initial  position  for  the  centroids  and  then  use  the  LBG  algorithm  to  converge  to  a 
local  optimum. 

We  have  used  the  notation  0, 0  to  denote  the  rules  that  tell  us  when  to  go  up 
in  resolution  and  how  to  split.  These  should  not  be  confused  with  the  wavelet  and 
scaling  functions  we  presented  in  the  last  chapter  that  were  denoted  by  the  same 
letters.  We  will  denote  trees  by  the  letter  T.  T'  -<T  denotes  that  T'  is  a  subtree 
of  T .  This  means  that  T'  and  T  have  the  same  root  node  and  all  nodes  belonging 
to  T'  also  also  belong  to  T.  T'  -<  T  implies  that  there  is  at-least  one  node  in  T 
that  does  not  belong  to  T' .  We  will  denote  nodes  by  the  letter  t,  leaves  by  t  and 
the  set  of  all  leaves  of  a  tree  T  by  T. 

Given  this,  the  algorithm  for  greedy  growing  is  as  follows: 
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Algorithm  I: 


1.  Fix  a  splitting  algorithm  i/j  and  a  rule  <f>  that  determines  when  to  split  at  a 
higher  resolution. 

2.  Initialize  tree,  T  =  c(po,So),  where  c(po,So)  denotes  the  generalized  centroid 
of  So  C  Md°,  the  zeroth  resolution  space  under  the  corresponding  probability 
density  p 0.  Initialize  iteration  count  i  —  0. 

3.  For  any  leaf  node  tj  belonging  to  the  set  of  leaf  nodes  T  of  the  tree  with 
corresponding  cell  Uj  C  at  resolution  d(tj)  decide  whether  to  go  to  the 
next  higher  resolution  or  to  split  at  the  same  resolution. 

4.  Calculate  A D(^,Uj)  in  the  corresponding  resolution. 

5.  Find  the  leaf  node  tj *  with  the  maximum  value  for  A Uj) 

6.  Implement  the  split  for  tj*. 

7.  If  stopping  criterion  is  reached,  stop.  Else,  increase  i  by  1  and  repeat  steps 
3-7 


4.3  Properties  of  the  algorithm 

4.4  Property  1:  Vanishing  distortion 

The  following  theorem  shows  that  under  a  condition  on  the  rule  that  decides  which 
resolution  to  split  in,  the  distortion  of  the  tree  can  be  made  arbitrarily  small  by 
implementing  the  greedy  growing  algorithm. 
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Theorem  1  Let  Alg(p ,  kn)  denote  the  set  of  all  trees  produced  by  the  greedy  grow¬ 
ing  algorithm  in  kn  iterations  and  Tn  e  Alg(p ,  kn).  Then,  if 

d\tm)  _ ^ 

log  (depth(tm)) 

for  the  sequence  of  nodes  {tm}  on  any  branch,  we  have  D(Tn )  — >  0 

Before  we  prove  this  theorem,  we  will  need  several  preliminary  results. 

4.5  Preliminaries 

Denote  by  D[T)  the  average  distortion  of  a  tree  structured  VQ.  We  assume  that 
the  probability  density  p  is  fixed  and  will  not  explicitly  mention  it. 

Lemma  3  For  any  two  trees  T,  T'  such  that  T'  -<T 

D(T ")  >  D(T) 

Proof:  The  proof  follows  easily  from  Lemma  1  and  the  fact  that  T  can  be  obtained 
from  T'  by  splitting  at-least  one  node 

Lemma  4  For  any  fixed  splitting  rule  if, 

^2  A-D  hi]  Mt)  <  Moo  (p) 

ter 

where  U] ,  Uf  is  the  partition  of  ofUt  given  by  if. 

Proof:  If  we  start  with  a  tree  of  a  single  node  at  resolution  0  with  a  distortion 
D*(S0)  and  successively  split  leaf  nodes  to  expand  the  tree,  each  split  of  a  node  t 
will  decrease  the  original  distortion  by  A D{Ut,U} ,Uf).  Since  the  total  distortion 
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of  the  tree  always  remains  non-negative  we  get 

D*(So)-J2AD(Ut,U?,U?)>0 

teT 

=*  ^  A D{UuUlMt)  <  D*(S0)  <  M^P)  (4.2) 

ter 

The  last  inequality  comes  from  the  optimality  of  D*(So)  and  the  fact  that  the 
function  x(t)  =  0  :  f  6  [0, 1)  belongs  to  <S0- 

When  the  greedy  growing  algorithm  is  applied  n  times  to  vectors  from  a  fixed 
probability  density  p,  we  get  a  sequence  of  trees  T0  -<T\  Tn>.  We  call  this 

a  trajectory.  If  n'  =  n  we  call  it  a  complete  trajectory.  Incomplete  trajectories  can 
form  if  AD(Uf  M~  ,U~)  =  0  for  all  t  E  Tn  in  all  resolutions. 


Lemma  5  [3f]  IfT  has  a  complete  trajectory  of  the  form  T0  -<  Ti  -<...-<  Tn  —  T, 
then 

min  ^  A D(UhU},Ut)<^rY 

1  <3<n  win) 

tZTj-i 

where  w{n)  =  ^'l=1  j~l 


Proof:  For  j  =  1 ,n  let  t*_x  G  be  the  terminal  node  that  is  split  to 
form  Tj.  Since  has  j  such  nodes,  the  greedy  node  selection  criterion  ensures 
that 


r1 


Therefore 


w(n 


min  ^  d.D(Ut,Ul,Uf)  <  Atm.  , U],  M.  ) 

l<j<n  L '  3  J-1  j  —  t 


ter,.! 


teT 


which  gives  the  necessary  result. 

The  following  lemma  is  a  geometric  result  that  will  be  used  later. 
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Lemma  6  Consider  two  concentric  balls  B i  =  B(0,  r)  and  B2  =  B(0,rVd)  in 
a  d-dimensional  space.  Now  take  a  hyper-plane  in  this  space  that  is  tangent  to 
the  inner  ball  B 1  and  intersects  the  outer  ball  in  a  hyper-circle,  thus  dividing  the 
surface  of  the  outer  ball  into  two  parts,  Vi  and  V2  with  Vol(V i)  <  V ol(V2)  (See 
figure  in  Appendix).  Then,  we  have 

voiiy i)  i  w, 

Vd{Vi)  +  Vol(y2)  2V2ff~e 

Proof:  Given  in  the  Appendix. 

Lemma  7  If  ip  is  an  optimal  splitting  rule,  then  for  every  setU  C  Sk  for  some  k, 
every  distribution  p  and  every  number  f3  >  0 

62 

A  A  D(U,  UfU1)  >  - _  P(x  eU:\ \Ykx  -  c(U ,  k,p)\  \  >  (3)  (4.3) 

2V27T  ed 

Proof:  Denote  by  x  the  k-t\i  resolution  representation  of  x  in  Mdfe.We  will  not 
make  the  resolution  k  explicit  as  long  as  there  is  no  confusion.  As  mentioned  in 
the  previous  chapter,  | \5Akx  —  S^ky\ |2  =  ||£  —  y\\2  for  any  x,y  G  Also  denote 

by  c  the  k- th  resolution  representation  of  c(U,  k,p).  Then 

P(x  e  U  :  | \y,kx  —  c(U ,  k,p)  ||  >  P)  =  P(x  eU  :  \\x  -  c\  \  >  P) 

Consider  two  balls  B\  =  B{c,P / y/df) ,B2  =  B(c,P)  around  c  with  radius  P / yfdf 
and  p.  Then,  from  the  previous  lemma,  any  hyper-plane  H  tangential  to  B\  will 
have  at- least  l/(2v27re)  fraction  of  the  surface  area  of  B2  on  the  side  not  containing 
the  common  center.  This,  in  turn,  means  that  there  will  be  a  hyper-plane  H*  such 
that 

P(HnBP)  >  I _ P(x  e  U  :  I  lx  —  ell  >  p) 

2y/27re 
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Let  v*  be  the  vector  in  H  that  is  closest  to  c.  Then  (c  —  v*Y(x  —  v*)  <  0  holds 
for  every  x  G  H,  which  gives 


I  I  '-'II  I  I  ‘'11—11'-'  ‘'II  7 

If  c1  is  the  centroid  of  V  =  U  H  and  c2  is  the  centroid  of  W  —  Uf' \  Hc,  then 

A D(U,V,W)  =  D(U,  c)  —  D(V,  c1)  —  D(W,  c2) 

=  Dk{U ,  c)  +  i?(W,  /c)  -  tDa.(V,  c1)  -  R(V,  k )  -  Dk(W,  c 2) 
k ) 

=  Dk(U,  c)  -  Dk(V,  c1)  -  Dk(W ,  c2) 


=  (Dk(V,  c)  -  Dk(V,  c1))  +  (Z>fc(W,  c)  -  Dk(W,  c2)) 

>  ^(V.cJ-D^c1) 

>  Dk(V,c)-Dk(V,v*) 

=  /  (||a;  —  c|| 2  —  ||a:  —  v*\\2)dp 

Jv 

02 

>  ^PiV) 

32 

>  — .  p(x  G  U  :  I  \x  —  cl  I  >  3) 

~  2^2 Ted 

which  gives  the  required  result. 

Denote  by  T(x)  the  function  that  maps  x  G  Ut  to  c(Ut,  kt,p)  where  Ut  C  L2(r) 
is  the  cell  associated  with  a  leaf  node  1  G  T  and  c(Ut,kt,p )  is  the  centroid  of  Ut 
when  it  is  split  in  the  kt- th  resolution.  Further  denote  by  k(x)  the  resolution  kt  of 
the  cell  to  which  x  belongs  to  and  by  c(x)  the  centroid  c(Ut,  h,p).  Then  we  have 
the  following  lemma 

Lemma  8  There  exists  constants  /3, 7  >  0  depending  only  on  5  and  p  such  that 
for  any  tree  T  with  D(T )  >  5,  created  after  a  sufficiently  large  number  of  iterations 
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of  the  algorithm, 


P(x  :  \\yk(x)  ~  c(z)||  >  p)  >  7 

Proof:  Assume  that  there  is  a  number  K  <  oo  such  that  P{x  :  ||a;||  <  K\  =  1. 
Then,  for  any  x,y,  \\x  —  y ||2  <  4A'2.  If  D(T)  >  6  for  any  T,  we  have 

*  -  S^x~nx)fdp 

<  §  +  U<2P{x:  ||x-T(x)||  >  y|} 

which  gives 

P{x  :  ||x  -  T(x)  ||  >  y|}  > 

Now  for  any  x  G  A2(r)  integer  k  and  c  G  ||a:— c||2  =  \\S^kX— c||2  +  ||a;— ^a;||2. 
Lemma  9  shows  that  the  depth  of  the  leaf  node  with  the  minimum  depth  keeps  on 
increasing  as  the  number  of  iterations  increases.  This  implies  that  the  resolution 
of  the  lowest  resolution  leaf  node  increases  indefinitely  and  we  can  always  find  an 
iteration  number  such  that  for  all  leaf  nodes  of  the  tree 

\\x  -  yk(x)x\\2  <  e2  Va: 

for  some  e  >  0  such  that  e  <  /3. 

This  implies  that  for  all  leaf  nodes  in  such  a  tree 

II*  -  c(x) II2  >  p2  =►  \\yk{x)x  -  c(x) f  >  p2  -  e2 

which  gives 

P{x  :  || Sk(x)  ~  c(^)||2  >  /32  -  e2}  >  P{x  :  ||x  -  c(z)||  >  /?}  >  7 

providing  the  desired  result.  The  more  general  case  where  p  does  not  have  compact 
support  can  be  derived  from  the  result  given  in  the  appendix  of  [34] . 
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Proof  of  Theorem  1:  If  the  algorithm  stops  before  all  iterations  have  been 
finished,  then  it  means  that  A D}-(UtMt  i^t)  =  0  f°r  all  leaf  nodes  t  e  T  and  all 
resolutions.  This  implies  either  that  the  probability  P{Uf)  is  zero  for  all  leaf  nodes 
or  that  all  the  probability  is  concentrated  at  a  point.  In  either  case,  the  distortion 
of  the  tree  is  zero. 

Now  consider  the  case  where  the  algorithm  does  not  terminate  until  all  itera¬ 
tions  have  been  finished.  For  a  large  enough  iteration  number  n,  consider  all  trees 
thus  formed  which  have  a  distortion  D(T)  >  6.  By  Lemma  5,  there  exists  a  T'  <T 
such  that 

5]  A D(U,M},U})  <  (4,4) 

win) 

t'eT1 

Since  T'  is  a  subtree  of  T,  it  follows  from  Lemma  3  that  D(T')  >  5,  and  thus  there 
exists  constants  f3, 7  >  0  depending  only  on  6  and  p  such  that 


P{x  :  \\yk(x)X  -  T'(x) II  >  13}  >  7 


(4.5) 


when  n  is  large.  Then,  the  optimality  of  if  and  Lemma  7  imply  that 

E  &D(u*,i4,i4)  >  ? 

Z  V  z 

t'eT' 


2\/2  7:edr 


-P{x  :  ||a;  —  T/(a;)||  >  (3}  >  rj 


(4.6) 


where  rj  =  7/32/2v/2vredmax  >  0  and  dmax  is  the  dimension  of  the  MR.A  at  the  leaf 
node  with  the  highest  resolution.  The  above  implies  that 


71 J  (  \  dmax  ^  ^(3 ~  n 

>  — =  >  0 


win) 


(4.7) 


2V27re 

If  dmax / w  (n)  — >  0  as  n  increases,  the  left  hand  side  goes  to  zero,  while  the  right 
hand  side  is  positive  and  independent  of  n.  This  shows  that  D(T)  >  6  for  only  a 
finite  number  of  trees.  Since  6  was  arbitrary,  this  gives  the  necessary  result. 

Note  that  ln(n)  <  w(n)  <  ln(n)  +  1,  so  the  condition  above  can  also  be  written 
as  dmax/  In (n)  — >  0. 
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4.6  Property  2:  Termination  with  rate  constraint 


In  the  previous  chapter  we  mentioned  a  stopping  criterion  for  the  greedy  growing 
algorithm.  The  growth  of  the  tree  is  stopped  when  this  criterion  is  satisfied.  For 
quantizers,  a  very  common  stopping  criterion  is  in  the  form  of  a  rate  constraint.  We 
keep  on  growing  the  tree  as  long  as  the  expected  bit-length  of  the  codebook  is  lesser 
than  a  given  R  and  stop  as  soon  as  it  gets  larger  than  R.  This  is  reasonable  when 
we  have  a  channel  with  a  fixed  capacity  and  we  want  to  constrain  the  quantizer  to 
have  a  rate  lesser  than  or  equal  to  the  channel  capacity. 

Here  we  will  show  that  for  a  rate  constraint,  the  algorithm  given  in  the  previous 
chapter  will  terminate  after  a  finite  number  of  iterations.  This  is  important  as  it 
ensures  that  only  a  finite  time  is  required  to  create  a  quantizer  in  practice. 

Theorem  2  For  a  stopping  criterion  of  the  form  r  <  R,  the  algorithm  Alg.  1 
terminates  after  a  finite  number  of  iterations  i 

The  proof  of  this  result  uses  the  following  lemma.  By  definition,  a  balanced  binary 
tree  is  a  binary  tree  where  all  leaf  nodes  are  at  the  same  depth  and  all  nodes  that 
are  not  leaf  nodes  have  both  their  children  present. 

Lemma  9  Let  T\  -<  T2  -<  . . .  be  the  sequence  of  trees  created  by  each  iteration  of 
Alg.  1.  Then  there  is  a  sequence  of  balanced  trees  Ti,  T2,  ■  ■  ■  such  that  T k  ■<  Tk  and 
Depth(Tk )  — ■>  oo 

Proof:  Let  Tt  be  the  largest  balanced  tree  that  is  a  subtree  of  T*.  Then  DepthiTf) 
is  non-decreasing.  If  DepthfTf)  does  not  go  to  infinity,  then  we  have  some  K  such 
that  Depthif  i )  — >  K .  This  implies  that  there  is  at  least  one  leaf  node  t  and  integer 
n  >  0  such  that  t  is  a  leaf  node  for  all  T.t)i>  n.  Then  for  all  such  TtJ  any  subtree 
T  F  Ti  contains  a  leaf  node  that  is  either  t  or  one  of  its  ancestors.  Lemma  2 
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gives  us  that  none  of  t  or  its  ancestors  have  probability  zero,  and  that  for  t  and  its 
ancestors,  A D(.)  >  0.  Denoting  by  Anc(t )  the  set  composed  of  t  and  its  ancestors, 
let  5  =  minfeAnc(t)  A D(t)  >  0. 

Since  t  or  one  of  its  ancestors  are  present  in  the  set  of  leaf  nodes  of  all  T*,  we 
have 


0  <  5  <  min  AD(Ut,  k(t )  :  if;)  < 

l<j<n  '  ^ 

t&Tj- 1 


MM 

w(k) 


Where  the  second  inequality  comes  from  Lemma  5  with  w(k)  =  j_1. 

Since  Moc/w(k )  — >  0  this  fails  to  hold  true  for  sufficiently  large  values  of  k. 
This  shows  that  our  assumption  that  Depth(fi )  converges  to  K  is  wrong  and  thus 
Depth{fi )  — >  cx). 

Proof  of  Theorem  2:  Let  Tf  ■<  T-2.  Then  R{T\ )  <  R(T2).  Also  the  rate  of  a 
balanced  tree  T  is  equal  to  Depthif).  These  two,  along  with  the  previous  lemma 
show  that  R(Ti)  — >  oo,  which  shows  that  the  algorithm  Alg.  1  with  a  stopping 
criterion  on  the  maximum  rate  will  terminate  after  a  finite  number  of  iterations, 
providing  the  result  we  need. 


4.7  Online  algorithm  for  distortion  minimization 

An  online  algorithm  for  distortion  minimization  has  been  implied  in  [2]  where  the 
authors  develop  an  online  algorithm  for  combined  compression  and  classification 
using  VQ.  Here  we  will  present  the  algorithm  and  show  how  it  is  related  to  the 
LBG  algorithm. 

The  algorithm  presented  in  [2]  is  as  follows.  We  assume  a  source  of  i.i.d.  random 
vectors  xi,x2,x3, . . .  that  have  a  distribution  p(x).  We  start  with  an  initial  set  of 
centroids  0  =  {ffi,  d2, . . . ,  Ok}-  Then  for  each  observation  xn  we  find  the  centroid 
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6i  that  is  closest  to  it  in  terms  of  the  distortion  function  p(xn,  Oi).  This  centroid  is 
moved  in  the  direction  of  xn.  Specifically 


Oi(n)  =  9 1 (n  -  1)  +  a(n)  p(xn,  O^n  -  1))  (4.8) 


As  shown  in  [2],  in  the  limit  a(n)  — >  0,  the  trajectory  of  0  follows  the  trajectory 
of  the  system  of  ordinary  differential  equations 


9i—  p(x)  Vep(x,  9i)dx 
Jv  i 


6*2  =  /  p(x)  Ve  p(x,82)dx 

Jv2 


Ok  =  /  p{x)  Ve  P(x,  9k)dx 
Jvk 

where  V,  is  the  cell  in  the  Nearest  Neighbor  partition  that  corresponds  to  centroid 
9i .  Note  that  each  V,  can  be  a  function  of  all  centroids  6i,  92, . . . ,  9^. 

In  [2]  the  authors  claim  that  this  is  a  gradient  descent  on  the  cost 

fc  „ 

J(0)  =  y]/  p(x)p(x,9i)dx  (4.9) 

*=i  J 

Here  we  will  prove  this  claim.  To  simplify  the  proof  we  will  consider  the  case  of 
only  two  centroids.  The  general  /e-centroid  case  can  be  similarly  proved  but  we 
will  only  present  pointers  on  how  to  extend  this  proof  for  that  case. 

The  ODE  that  characterizes  the  behavior  of  the  centroids  in  the  2-centroid 
algorithm  is 


9 1=  p(x)  S7ep(x,  9i)dx 

Jv  i 

02=  p(x)  Ve  p(x,  02)dx 

Jv2 


(4.10) 


50 


where  Vi  =  {x  G  R  :  p(x,9 1)  <  p(x,  92)}  and  V2  =  {x  G  M  :  p(x,92 )  <  p(x,9 1)}. 
The  cost  4.9  is 

=  p(x)p(x,9i)dx  (4.11) 

*=i  J 

Taking  the  gradient  of  the  cost  with  respect  to  6\  or  02  is  complicated  by  the 
fact  that  the  regions  of  integration  Vi  and  V2  are  themselves  functions  of  9\  and 
9 2.  To  simplify  notation,  let  us  denote 

g{x,9)  =  p(x,9)p(x) 

Then  the  expected  distortion  D(9i,92)  can  be  written  as 

D(9i,92)=  I  g(x,91)dx+  f  g(x,92)dx  (4.12) 

JVl(01,02)  4V2(0 1,02) 

where  we  have  made  explicit,  the  dependence  of  Vi  and  V2  on  9\  and  92. 

To  find  the  gradient  of  4.12  with  respect  to  9\ ,  let  us  compute 


Denote  144  =  +  A 9,92)  n  V2(91,92)  and  W2  =  V2(0i  +  A 9,92)  n  V^A,^) 

(see  Fig.  4.1).  Then  we  can  write 
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V2(01 ,02)  VI  (01 ,02) 


V2(01+A0,02)  VI  (01  +A0,02) 


Figure  4.1:  Illustration  of  Vi ,  V2 ,  Wi  and  W2  in  2  dimensions 


D(9i  +  Ad,  02)  -  D{9 u02) 


=  /  [g(61  +  A6,x)-g(61)]dx 

Jv  i(fi>i,e>2) 

+  /  g(0i  +  AO,  x)  dx  —  /  g(A  +  Ad,  x)  dx 
Jwi  Jw2 

+  /  g(02,x)  dx  —  /  g(02,x)dx 

Jw2  J\Vi 

=  Adr  /  dx  +  (higher  order  terms  in  A#) 

Jvi(0i,e2) 

+  /  g(di  +  Ad,  x)  dx  —  /  (?(di  +  Ad,  x)  dx 
JWx  Jw2 

+  /  g(02,x)  dx  —  /  g(02,x)dx 

Jw2  Jwi 

The  sets  Ilfi  and  IT2  are  infinitesimally  thin  sets  along  the  hyper-plane  that  sep¬ 
arates  Vi  and  V2.  Since  the  definition  of  this  hyper-plane  is  such  that  H  —  (x  : 
p(x,$i)  <  p(x,  d2)}  we  have  g($i  +  A 6,x)  —  g(02,x)  =  O(AO).  Thus 


/  g{0\  +  Ad,  x)  dx  —  /  p(0i  +  Ad,  x)  dx  +  /  g{02,x)dx—  /  g{02,x)dx 

'w 1  Jw2  Jwi 

=  0  +  (higher  order  terms  in  Ad) 


52 


Thus 


D(9l+Ad,92)-D(d1,d2) 


a eT  [ 

JVl(01,02) 


X/giO i,x)  dx+  (higher  order  terms  in  Ad) 


which  implies  that 


\/eiD{9 1,02) 


Ve1fl'(6li,a;)  dx 


S7oiP{0i,x)p(x)  dx 


Similarly 

Ve2D(9i,d2)=  S7e2p(92,x)p(x)  dx 

■JV2 

which  is  the  result  we  wanted. 

Note  that  the  method  of  proof  depended  on  the  fact  that  the  infinitesimal 
change  in  total  distortion  is  zero  when  a  perturbation  in  9\  causes  a  point  close  to 
the  separating  hyper-plane  to  move  from  V\  to  V2.  This  method  cannot  be  used  to 
establish  the  cost  function  that  is  minimized  by  the  Learning  Vector  Quantization 
algorithm. 

The  extension  of  this  proof  to  more  than  two  centroids  is  messy  but  straight¬ 
forward.  Instead  of  looking  at  the  hyper-plane  between  only  two  centroids  we 
have  to  look  at  the  hyper-planes  between  a  given  centroid  and  all  other  centroids. 
An  argument  similar  to  the  one  above  shows  that  perturbation  in  the  position 
of  one  centroid  does  not  change  the  distortion  contribution  from  points  near  the 
separating  hyper-planes. 


4.8  Practical  implementation  of  the  MRTSVQ 

Practical  implementation  of  the  tree  structured  quantizer  brings  up  several  issues 
that  have  not  been  addressed  above.  Two  of  them  are  initialization  of  centroids 


53 


and  choosing  0,  the  function  that  tells  whether  to  split  in  the  current  resolution 
or  the  next.  Initialization  of  centroids  is  accomplished  by  randomly  selecting  data 
vectors  that  falls  into  the  cell.  This  usually  gives  good  results.  If  we  have  more 
than  two  centroids  at  each  split,  we  can  initially  split  with  just  two  centroids  and 
then  introduce  additional  centroids  one  after  another  at  random  positions. 

One  way  of  deciding  whether  to  go  up  a  resolution  is  by  comparing  the  change 
in  distortion  on  splitting  at  the  current  resolution  to  the  total  distortion.  If  the 
change  is  less  than  a  certain  fixed  fraction  (call  it  So),  then  we  split  at  the  next 
higher  resolution.  Increasing  So  will  force  the  algorithm  to  go  to  higher  resolutions 
early  on,  but  will  result  in  lower  distortion  for  a  given  rate.  If  hi)  is  chosen  small 
the  algorithm  exploits  more  of  the  information  in  the  lower  resolution  before  trying 
out  the  higher  resolutions.  The  tradeoff  is  that  one  ends  up  with  a  higher  rate  for 
the  same  total  distortion  level. 

Another  way  of  choosing  the  resolution  to  split  at  is  to  look  at  a  weighted 
sum  of  the  computational  complexity  and  the  decrease  in  distortion  for  a  split  at  a 
given  resolution.  The  resolution  that  minimizes  this  function  will  be  chosen  for  the 
splitting.  Different  tradeoffs  between  computational  complexity  and  tree  efficiency 
can  be  obtained  with  different  weights. 
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Chapter  5 


Multi-resolution  classifiers 


In  this  chapter  we  will  explore  the  use  of  Multi-Resolution  TSVQ  as  a  classifier. 
We  assume  that  a  source  can  be  in  any  one  of  two  states  or  classes  when  it  outputs 
a  signal  x(t),t  G  [0, 1).  Labeling  the  classes  by  1  or  2,  we  have  a  probability  density 
defined  on  C2[r)  that  depends  on  whether  the  source  is  in  class  1  or  2.  Let  pi(x) 
denote  the  probability  density  at  x  given  that  the  source  was  in  class  1  and  p2(x) 
denote  the  density  given  that  the  source  was  in  class  2.  We  assume  that  either 
class  occurs  independently  with  probabilities  7Ti  and  7t2  =  1—  t\\. 

Our  problem  is  to  estimate  the  class  of  the  source  given  only  the  output  x. 
As  mentioned  in  Chapter  3,  if  we  know  the  prior  probabilities  and  the  marginal 
probability  densities,  the  Bayes  optimal  classifier  achieves  the  least  possible  mis- 
classification  error.  If  all  we  have  is  a  training  set  consisting  of  pairs  {xn,  cn}  of 
observations  and  classes,  we  need  algorithms  that  find  a  classifier  that  comes  as 
close  to  the  Bayes  optimal  classifier  as  possible. 
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5.1  Multi- resolution  vector  classification 

Now  consider  the  above  problem  of  constructing  a  classifier  from  training  samples 
with  the  added  feature  that  the  signals  output  by  the  source  can  be  observed  in 
various  levels  of  detail.  The  first  question  is:  How  is  this  useful?  The  answer  is 
that  processing  each  signal  to  determine  the  class  might  be  easier  if  less  detailed 
versions  of  the  signal  are  used.  Most  classifiers  compare  the  signal  to  a  set  of 
exemplars  and  this  comparison  is  usually  computationally  simpler  if  the  signal  is 
of  lower  dimension. 

Additionally,  in  some  cases,  most  of  the  high  level  detail  of  the  signal  is 
swamped  in  noise  and  only  the  low  resolution  features  offer  usable  information 
about  the  class  of  the  signal.  In  such  a  case  using  the  high  resolution  details  might 
make  the  classifier  perform  worse  compared  to  using  low  resolution  representation 
since  the  algorithm  tries  to  model  the  noise. 

Multi-resolution  analysis  is  used  to  find  out  good  features  for  other  kinds  of 
classifiers  too.  Typically,  the  multiple  resolutions  of  the  signal  are  examined  to 
find  those  resolutions  that  have  the  most  potential  for  the  classification  problem. 
In  a  TSVQ,  this  selection  is  done  automatically.  At  each  step  of  the  tree  growing 
process,  we  not  only  decide  which  node  to  split,  but  also  which  resolution  to  split 
it  in.  A  resolution  that  offers  the  best  trade-off  between  dimensionality  and  error 
reduction  is  chosen. 

The  second  question  is:  How  can  we  construct  hierarchical  classifiers  using 
training  sets  in  multiple  resolutions?  Here  we  will  provide  one  solution  based  on 
an  extension  of  the  Learning  Vector  Quantization  we  presented  in  Chapter  2. 
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5.2  Multi-level  LVQ 

Taking  the  algorithm  used  to  create  the  MRTSVQ  for  compression,  we  can  try  to 
extend  LVQ  to  the  multi-resolution  tree  structured  classifier  in  a  straightforward 
way.  We  start  out  with  a  root  node  that  contains  as  its  cell,  all  the  training  vectors 
at  the  lowest  resolution.  The  class  of  the  root  node  is  assigned  according  to  the 
majority  vote  amongst  the  training  vectors;  if  the  class  1  vectors  outnumber  the 
class  2  vectors,  then  the  class  of  the  node  is  1  and  vice-versa.  Then,  recursively, 
each  leaf  node  of  the  tree  is  examined  to  find  the  one  that  gives  the  biggest  decrease 
in  the  classification  error  when  its  cell  is  split  into  two  parts,  each  part  assigned 
a  different  class.  This  splitting  is  done  by  the  LVQ  algorithm  at  an  appropriate 
resolution.  The  node  that  gives  the  highest  difference  is  then  split  to  produce  a 
new  tree  and  this  process  repeats. 

The  only  problem  with  this  algorithm  is  that  LVQ  takes  a  long  time  to  converge 
(learning  rate  a(n )  =  0(l/n)).  Waiting  until  the  centroids  at  a  higher  resolution 
have  converged  before  going  on  to  the  next  resolution  is  very  time  consuming. 
Additionally,  in  an  online  learning  case,  we  cannot  use  the  classifier  until  all  the 
levels  of  the  tree  have  converged.  In  such  a  case,  we  would  be  interested  in  an 
algorithm  where  all  levels  of  the  tree  are  simultaneously  updated  when  each  data 
sample  arrives. 

Such  an  algorithm  would  start  out  with  a  structure  for  a  tree  with  centroids  for 
each  node  initialized  to  appropriate  values.  Then,  as  each  training  sample  arrives, 
it  will  be  used  to  update  the  centroids  of  all  nodes  of  the  tree  at  the  appropriate 
resolution  simultaneously.  Thus  the  tree,  as  a  whole,  adapts  to  the  data  rather 
than  each  node  adapting  separately.  The  classifier  tree  can  be  used  at  any  time  for 
predicting  the  class,  with  a  misclassihcation  error  that  goes  down  as  the  centroids 
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converge. 

Let  us  now  present  the  formal  algorithm 
Algorithm  II: 

1.  Fix  in  advance  the  MRA,  size  of  tree  (maximum  depth),  data  size  and  number 
of  iterations. 

2.  Start  with  an  initial  tree  T0  consisting  of  a  structure  that  details: 

(a)  How  the  nodes  are  connected  to  each  other;  i.e.  which  nodes  are  the 
children  of  a  given  node. 

(b)  The  resolution  of  each  node.  Each  node  is  at  the  same  resolution  as  its 
siblings  and  we  require  that  the  resolution  is  non-decreasing  down  the 
tree  along  every  branch. 

(c)  The  initial  positions  of  the  centroids  at  each  node.  The  centroids  at 
each  node  must  be  at  the  resolution  of  the  node. 

3.  Initialize  iteration  number  i  —  1  and  data  number  n  =  1. 

4.  Take  data  vector  xn  at  the  lowest  resolution  and  compare  it  to  the  centroids 
of  the  children  of  the  root  node  and  perform  the  following  update 

0o (*+1)  =  0l(t)  -  ao{t){y0xn  -  el(t)) 

if  \\y0xn  -  0oCOH2  <  listen  -  Ol(t)\\2  and  cn  =  1 
0o(*+l)  =  0o(*)  +a0(t)(y0xn  -0o(t)) 

if  \\y0xn  -  6»o(t) ||2  <  \\y0xn  -  02(t)\\2  and  cn  =  2 
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6»o(t+l)  =  Ol(t)  -  a0(t)(y0xnr~  0jj(t)) 

if  \\y0Xn  -  9jj(t) II2  <  \\y0xn  -  dl(t)\\2  and  cn  =  2 
9%(t  +  1)  =  6l(t)  +a0(t)(y0xn  ~Ol(t)) 

if  \\y0xn  -  6»o(t)  II2  <  \\y0xn  -  6»o(t)  ||2  and  cn  =  1 

5.  Recursively,  for  each  pair  of  nodes  9} ,  9‘f  at  resolution  k,  do  the  following 
update  only  if  their  parent  was  updated. 

9]  (t  +  1 )  =  9]  ( t )  -ak  ( t )  (y'kXn  -  9\  ( t ) ) 

if  \\ykxn  -  9}(t) II2  <  \\ykxn  -  92(t) II2  and  cn  =  1 
(t  +  1 )  =  d,1  (t)  +ak  ( t )  (ykxn  ~  9}  (t) ) 

if  || ykxn  -  9} (f)||2  <  \\ykxn  -  9i(t) ||2  and  cn  =  2 

9i(t  +  1)  =  6»2  (t)  -  (t)  (yt3:„  -  9]  (t) ) 

if  \\S?kxn  -  92(t) ||2  <  \\ykxn  -  9](t)\\2  and  cn  =  2 
02  (t  +  1 )  =  d2  (t)  +  ak  ( t )  (^fcXn  -  0t*  (t) ) 

if  ||^ -  6»2(f)||2  <  \\ykxn  -  92{t) ||2  and  cn  =  1 

6.  Increase  data  number  by  1.  Repeat  for  all  data  vectors. 

7.  Increase  iteration  number  by  1.  Repeat  for  all  iterations. 

Note  that  there  is  an  implicit  linkage  between  a  node  and  its  parent.  Only  the 
vectors  that  cause  the  parent  to  be  updated  will  be  considered  for  updating  the 
node  and  its  siblings.  Thus,  the  only  vectors  being  used  to  split  a  node  will  be  the 
vectors  that  fall  into  its  cell. 
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5.3  Convergence  of  multi-level  LVQ 


In  our  algorithm,  the  decision  whether  to  consider  a  vector  for  the  update  of  a 
node  depends  on  whether  the  vector  fell  in  the  cell  of  the  node’s  parent.  Since  the 
cell  of  the  parent  changes  with  the  position  of  the  centroid  at  each  time  step,  the 
centroids  of  the  node  have  to  try  and  keep  up  with  a  “moving  target.”  We  can 
ask  the  question:  Under  what  conditions  will  the  centroids  of  the  tree  converge  to 
a  stationary  value  as  the  number  of  data  vectors  increase  indefinitely? 

To  investigate  this,  we  will  need  to  present  and  prove  a  theorem  that  is  a  slightly 
different  from  the  result  shown  by  Borkar  [5].  While  [5]  deals  with  stochastic 
approximation  algorithms  in  two  levels  with  different  “training  speeds” ,  we  need  to 
generalize  it  to  arbitrary  number  of  levels  K .  In  addition,  the  algorithm  considered 
in  [5]  has  an  update  function  that  depends  on  the  state  variables  at  lower  levels 
(i.e.  update  for  state  Xi  depends  on  xj,  j  >  i).  We  will  only  need  update  functions 
that  depend  only  on  state  variables  at  higher  levels.  This  adds  some  simplicity  to 
the  proof. 

5.3.1  Preliminaries 

In  what  follows,  we  will  denote  by  ay(n)  G  the  drdimensional  component  of  the 
state  space  in  the  i-th  level  at  time  step  n.  W(n )  =  [xo(n),  xi(n), . . .  Xi(n)]  denotes 
the  set  of  state  variables  that  are  at  or  above  the  i-th  level.  X(n)  =  Xk-i  (n)  = 
[x0(n),  Xi(n), . . .  XK-i(n)]  denotes  the  total  state  space  of  the  algorithm. 

Consider  the  stochastic  approximation  algorithm  consisting  of  the  linked  dif- 
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ference  equations 


x0(n  +  1)  =  x0(n)  +  a0(n)[fo(X0(n))  +  M0(n )] 
xi  (n  +  1)  =  xi(n)  +  ai(n)[fi(Xi(n))  +  Mi(n)} 

xA'_i(n  +  l)  =  xK-i(n)  +  aK-i(n)[fK-i(XK-i(n))  +  Mx_i(n)]  (5.1) 

where 

•  A.l  fi(Xi(n ))  :  Rdo+di+d2+--<ii  jg  LipSChitz  for  all  i 

•  A. 2  J2nai(ln)  =  oo  and  Y2nai(n)2  <  oo  for  all  i 

•  A. 3  ai(n)/aj(n )  — >  0  if  z  <  j  for  all  i,j 

•  A. 4  If  Fn  =  cr{Xj(/),  Mj(/)|/  <  n,  Vi}  denotes  the  cr-algebra  formed  by  the 
state  Xi  and  the  noise  Mj ,  then  ( Mi(ri),Fn )  are  random  variables  satisfying 

cq  ( n)Mj(n )  <  oo  a.s. 

n 

In  the  case  of  the  multi-level  LVQ,  we  will  show  that  G*(n)  =  ^",=1  ai(m)Mi(m) 
is  a  martingale;  then  condition  A. 4  follows  from  a  version  of  the  martingale  con¬ 
vergence  theorem. 

The  usual  method  of  analyzing  the  convergence  of  a  system  of  stochastic  ap¬ 
proximations  like  5.1  is  to  compare  it  to  the  associated  ordinary  differential  equa¬ 
tion  (ODE).  Denoting  by  the  same  symbols  {xi},  {Aj},  the  state  variables  in  con- 
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tinuous  time,  we  have  the  system  of  equations 


xo(t)  =  fo(X0(t)) 
eii  (t)  =  /i(Xi(t)) 

e2x2{t)  =  f2(X2(t)) 


CK  lXK-lif)  —  fK-l(XK-l(t))  (5.2) 

for  the  limit  e  |  0.  This  is  a  generalized  singularly  perturbed  system  where 
each  variable  Xi  behaves  as  if  all  variables  x0  to  Xj_i  are  constants.  Denote  by 
Aj(xo,  Xi, ,  Xi- 1)  the  function  that  gives  the  equilibrium  point  of  Xi(t)  =  fi(Xi(t )) 
for  given  constant  values  of  Xo,  x±, . . . ,  Xj_i. 

Now  suppose  that  there  is  an  asymptotically  stable  equilibrium  A"*  =  [a^,  x*,  . . . ,  x*K_f\ 
for  the  system  of  equations  5.2.  Then  we  have  the  following  theorem 

Theorem  3  If  there  is  an  N  such  that  X  (■ n )  remains  in  the  domain  of  attraction 
of  X*  for  all  n  >  N  and  supn  X(n)  <  oo,  the  iterates  5.1  converge  to  X*  a.s. 

The  assumption  that  X  (n)  remains  in  the  domain  of  attraction  of  the  equi¬ 
librium  and  that  its  supremum  is  bounded  might  not  be  always  valid.  In  that 
case,  we  can  keep  projecting  X (n)  back  into  the  bounding  set  at  the  expense  of  an 
error  term.  [25]  shows  how  to  deal  with  the  error  term  in  the  case  of  constrained 
optimization. 

5.3.2  Proof  of  the  theorem 

Our  proof  closely  parallels  and  borrows  much  of  its  notation  from  [5].  We  will 
first  show  that  the  state  variable  Xi(t)  at  any  level  i  >  0  converges  to  x*  = 
Aj(xo,xi, . . . ,  Xi-i).  Then  we  will  show  that  xq (t)  converges  to  Xq. 
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First,  a  definition  and  a  lemma.  Consider  the  ODE  in  given  by 

z(t)  =  h(z(t))  (5.3) 

for  a  Lipschitz  h  such  that  Eq.  5.3  has  an  asymptotically  stable  attractor  J  with 
a  domain  of  attraction  D(J).  Given  T,  5  >  0,  we  call  a  bounded  measurable 
y(t)  :  R+  — ■>  Rd  a  (T,  5)-perturbation  of  Eq.  5.3  if  there  exists  0  =  T0  <  Tf  <  . . .  < 
Tn  =  oo  with  Tl+\  —Ti>T  and  solutions  z3{t)  of  Eq.  5.3  such  that 

sup  \\zJ(t)  -  y(t)\\  <  5 

t€.[Tj  ,Tj+l] 

Lemma  10  Given  e,  T  >  0  there  exists  a  5  >  0  such  that  for  S  G  (0,5),  every 
(T,  S) -perturbation  of  Eq.  5.3  converges  to  the  e-neighborhood  Je  of  J. 

The  proof  is  given  in  the  appendix  of  [5] 

Fix  a  level  i.  We  will  now  show  that  as  far  as  this  level  is  concerned,  the  ODE 
being  followed  by  5.1  is 


xo(t)  =  0 

xi(t)  =  0 

iy_i(f)  =  0 

ii(t)  =  fi{Xi(t )) 
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For  a  given  T  let  £;(0)  =  0  =  T)( 0)  and  define 


n 


ti(n) 

=  Yai(k),n  >  1 

k= 1 

m(  0) 

=  o, 

( 

m(n) 

=  min  <  k  >  m(n 

Ti(n) 

\ 

=  tifmfn)) 

1)1 


Y  ai(j)  ^  T 


j=m(n— 1)+1 


(5.4) 


With  this  notation  we  have  T)(j  +  1)  e  [Ti(j)  +  T,  Tfj )  +T  +  Ci\,Ci  =  supn  cq(n). 
Define  W(f)  =  {x0(t) ,  Xi(t) , . . .  ,Xi(t)}  as  X^t^n))  =  X^n)  with  a  linear  interpo¬ 
lation  for  the  time  between  the  instants  t*(n)  and  f*(n  +  1). 

Consider  the  system  of  equations  5.4.  We  have  the  lemma 

Lemma  11  For  any  5  >  0,  there  exists  a  time  t§  >  0  such  that  Xi(t  +  t§)  is  a 
(T,  5)  perturbation  of  5.4 

Proof:  Rewrite  the  stochastic  approximation  algorithm  5.1  up  to  level  i  as 

x0 {n  +  1)  =  x0(n)  +  ati(ri)  a°^  [f0(X0(n))  +  M0(n)\ 

xi  (n  +  1)  =  xi(n)  +  «i(w)Qil^[/i(A'i(w))  +  M^n)]  (5.5) 


Xi(n  +  1)  =  Xi(n)  +  ai(n)[fi(Xi(n))  +  M^n)] 

Let  Xf(t)  be  the  solution  to  5.4  on  [Tj(ri),  oo)  with  initial  conditions  Xf(t)  = 
Xi(n).  Then  5.5  can  be  seen  as  a  discretized  version  of  5.4  with  step  sizes  cq(n) 
and  an  error 

Cxiyii  j 
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at  the  j-th  level,  j  <  i  and  a,(n)  M,(n)  at  the  i-th  level.  The  fact  that  all  fj 
are  Lipschitz  and  that  otj(n) / ai(n )  — >  0  for  all  j  <  i  makes  the  contribution  of 
this  error  asymptotically  negligible  as  n  — >  oo.  With  the  fact  that  cq(n)  — >  0,  we 
get  the  required  result  by  a  standard  approximation  argument  using  the  Gronwall 
inequality. 

Lemma  12  Ij(n)  — >  [rr0,  xi, . . . ,  xt- 1,  A*  (To,  xi, . . . ,  Xj_i)] 

This  follows  from  the  above  two  lemmas. 

Now  we  have  to  show  that  Xq (n)  converges.  This  follows  in  a  straightforward 
fashion  from 

Lemma  13  Under  the  conditions  in  Theorem  3  and  A.l  to  A. 4,  xq (n)  — >  Xq  where 
x^  is  the  equilibrium  solution  of 

Xo  (t)  =  fo(xo(t)) 


a.s. 

Proof:  See  [25] 

This  completes  the  proof  of  Theorem  3 

5.3.3  Multi-level  LVQ 

The  application  of  Theorem  3  to  the  Multi-level  LVQ  algorithm  is  straightforward 
once  we  set  the  algorithm  up  in  the  form  of  5.1  and  verify  that  assumptions  A.l 
to  A. 4  are  satisfied. 

To  make  the  link  between  Algorithm  1  in  section  5.2  and  the  stochastic  ap¬ 
proximation  algorithm  5.1,  let  us  denote  the  set  of  all  centroids  at  resolution  k  by 
©fc(n)  =  {0l(n),6i(n)}  node  i  belongs  to  the  k- th  level  of  the  tree.  Also  denote 
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by  y'kX  =  [y'kX,  y'kX, . . . ,  ^x]  where  the  number  of  repetitions  is  equal  to  the 
number  of  centroids  at  resolution  k.  Then  the  LVQ  algorithm  can  be  written  as 

0o(n  +  1)  =  0O W  +  a0(n)77o(©o(n))[@0(n)  -  ^0x(n)] 

©i (n  +  1)  =  ©i(n)  +  ai(n)?/i(0o(n),  ©i(n))[©i(n)  -  <54 x(n)\ 

0A'-i(n  +  l)  =  ©A'-i(n) 

+aK-i(n)r]K-i(<do(n),  ©i  (n), ©x-i(n))[©x-i(n)  -  yK-ix\ 

where  ^(©o,  ©i, . . . ,  ©«)  is  a  diagonal  matrix  of  size  4*4  x  4  <4  where  4  is  the 
number  of  centroids  and  c4  is  the  dimension  of  the  signal  vector  at  the  resolution 
k.  The  diagonal  is  composed  of  blocks  of  length  c4  where  all  elements  of  the  j-th 
such  block  are 

0  if  vector  x  does  not  fall  into  each  ancestor  of  centroid  j 

1  if  re  falls  into  each  ancestor  of  j  and  the  class  of  x  is  different  from  class  of  j 
y— 1  if  x  falls  into  each  ancestor  of  j  and  the  class  of  x  is  same  as  class  of  j 

Here  centroid  j  refers  to  the  j-th  amongst  the  4  centroids  at  resolution  k. 

Now  let  us  show  that  assumption  A. 4  is  satisfied 

4(00  W,  ©i(n),  •  •  • ,  ©i(n))  =  Ex{rji(e0(n),  ©i(n), . . . ,  ©i(n))[©i(n)  -  J^x]} 
and 

Mi(n)  =  r/i(. .  .)[©i  -  &ix\  -  fi(. . .) 

Then  Ex(M(n ))  =  0.  Given  that  xn  and  xrn  are  uncorrelated  for  n  4  m,  it  is  easy 
to  verify  that 

E{Mi(n)Mi(m)}  =  0 


66 


These  two  facts  imply  that  is  a  martingale  difference  sequence  and  that 


Gi{n)  =  ^2  on(m)Mi(m) 

m= 1 

is  a  martingale.  Then,  a  martingale  convergence  theorem  like  Theorem  5.14  in  [7] 
shows  that  Gi(n)  converges  a.s. 

A. 2  and  A. 3  are  design  issues,  so  the  only  other  assumption  we  have  to  check 
is  A.I.;  i.e.  that  /;(. . .)  is  Lipschitz.  ft  is  not  hard  to  verify  that  this  is  satisfied  if 
x  has  a  probability  distribution  p(x )  =  ttiPi(x)  +  vr2p2(^)  that  is  bounded. 

Thus,  all  assumptions  for  Theorem  3  are  satisfied  and  we  get  the  result  that 
all  centroids  Oj{n )  converge  as  n  — >  oo. 

5.4  Issues  in  practical  implementation 

Practical  implementation  of  the  above  algorithm  introduces  several  issues  that  are 
not  readily  apparent  from  the  description,  ffere  we  will  discuss  some  of  them  and 
offer  heuristic  solutions. 

The  convergence  result  holds  only  if  there  is  an  equilibrium  solution  to  the 
associated  ODE  and  if  the  initial  conditions  are  close  enough  to  it.  The  ODE 
might  not  have  an  equilibrium  solution  if,  for  a  cell,  there  is  no  hyper-plane  that 
will  partition  it  into  two  parts  with  clear  majorities  in  each  part.  This  can  mean 
two  things;  either  there  is  no  more  improvement  possible  by  splitting  that  node 
at  the  current  resolution  or  improvement  is  possible  only  with  more  than  two 
partitions.  If  it  is  the  former  case  one  might  try  splitting  at  a  higher  resolution.  In 
the  latter  case  the  algorithm  can  be  modified  to  include  the  possibility  of  splitting 
a  node  with  more  than  two  partitions  and  its  convergence  proved  without  difficulty. 
Proper  initialization  of  the  centroids  is  very  important  for  fast  convergence. 
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Several  heuristic  rules  exist  for  good  initialization.  One  method  that  works  well  is 
randomly  selecting  a  data  vector  of  each  class  as  initial  values.  Another  method  is 
to  initialize  the  centroid  to  the  mean  of  the  data  vectors  that  belong  to  its  class. 

It  is  necessary  to  initialize  the  class  of  each  centroid  according  to  the  majority 
vote  in  its  partition  and  update  the  class  after  a  number  of  iterations.  [26]  shows 
an  example  where  failure  to  do  this  will  result  in  divergence  of  the  centroids  even 
though  there  is  a  stable  solution.  It  might  happen  that  this  updating  causes  the 
class  of  both  centroids  to  be  the  same;  in  such  a  case  re-assignment  of  the  centroids 
and  their  classes  is  the  only  recourse. 


Chapter  6 


Simulations  and  applications  of 

MRTSVQ 


In  this  chapter  we  will  present  two  simulation  examples  for  compression  and  clas¬ 
sification.  Working  on  synthetic  data,  these  simulations  will  illustrate  the  im¬ 
plementation  of  the  algorithms  we  presented  in  earlier  chapters  and  their  typical 
behavior. 

We  will  also  present  several  applications  of  clustering  and  classification  using 
MRTSVQ.  Tree  structured  VQ  will  be  used  to  cluster  yeast  genes  according  to  their 
expression  profile  and  for  classification  of  cells  into  tumerous  and  non-tumerous 
classes.  Then  we  will  present  a  parallel  tree  method  for  predicting  wear  on  a  milling 
tool.  Computational  advantages  of  the  MRTSVQ  method  will  be  illustrated  by  an 
application  in  fingerprint  identification. 
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6.1  Simulation  for  compression 


In  this  section  we  will  show  the  results  obtained  by  using  the  greedy  algorithm 
for  quantization  of  signals.  We  have  used  the  algorithm  described  in  Chapter 
4.  Splitting  was  done  using  the  LBG  algorithm  with  random  initial  values  for  the 
centroids.  The  decision  to  go  up  in  resolution  was  taken  on  the  basis  of  the  decrease 
in  distortion  on  splitting  a  leaf.  If  the  decrease  was  less  than  a  fixed  fraction  of 
the  total  distortion  of  the  leaf,  we  proceeded  to  the  higher  resolution. 

The  quantized  signals  were  created  as  the  output  of  a  linear  filter  of  2nd  order 
with  an  i.i.d,  Gaussian  noise  input.  These  signals  were  then  analyzed  in  7  resolu¬ 
tions  using  a  Bi-orthogonal  wavelet  of  order  2  for  both  analysis  and  reconstruction. 
The  resulting  coefficient  vectors  were  transformed  to  have  the  same  norm  as  the 
function  they  approximate.  Fig.  6.1  shows  an  example  of  a  signal  at  7  resolutions. 


Figure  6.1:  Signal  in  7  resolutions 


Fig.  6.2(a)  shows  the  decrease  in  distortion  as  the  number  of  splits  increase 
while  Fig.  6.2(b)  shows  the  increase  in  rate  as  the  tree  grows.  Note  that  distortion 
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decreases  to  zero  in  the  asymptotic  case  while  rate  goes  to  infinity  as  was  shown 
in  Chapter  4. 


(a)  Distortion  vs.  split  number 


(b)  Rate  vs.  split  number 


6.2  Simulation  for  classification 

We  used  the  greedy  tree  growing  algorithm  to  create  a  tree  structured  classifier  for 
multi-resolution  data.  Each  signal  can  be  expressed  as 

K 

s(t)  = 

k= 1 

where  each  Sk{t)  is  re-sampled  from  a  c4  dimension  random  vector  Xk-  The  Mat- 
lab  command  “resample”  was  used  for  this  purpose.  The  random  vector  Xk  was 
produced  from  a  Gaussian  mixture  density  for  either  class.  The  distance  between 
the  mean  of  :r>.  for  class  1  and  2  increases  as  k  increases.  This  results  in  a  signal 
s(t)  that  has  increasing  information  relevant  to  the  classification  as  the  resolution 
increases.  The  Daubechies  8th  order  wavelet  was  used  to  analyze  the  data  into 
5  resolutions.  The  tree  growing  algorithm  used  LVQ  to  split  each  node.  At  each 
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iteration,  the  node  that  offered  the  biggest  reduction  in  the  classification  error  was 
chosen  to  be  split.  Fig.  6.2  shows  the  decrease  in  the  Bayes  risk  as  the  number  of 
splits  increase. 


Figure  6.2:  Classification  error  vs.  split  number 


6.3  Yeast  gene  clustering 

An  application  of  un-supervised  clustering  for  clustering  yeast  ( Saccharomyces 
Cervisiae )  genes  according  to  their  expression  profiles  is  presented  in  this  section. 
This  application  is  motivated  by  the  recent  advances  in  gene  analysis  techniques 
that  have  the  potential  to  measure  the  expressions  of  thousands  of  genes  at  the 
same  time  [46]  [40].  Since  each  cellular  process  is  the  result  of  several  hundred 
genes  acting  in  concert,  such  techniques  have  expanded  our  ability  to  study  the 
cell  at  an  unprecedented  level  of  detail. 

The  inevitable  consequence  of  this  detail  is  the  flood  of  data  that  is  produced 
by  techniques  such  as  micro-arrays.  There  is  an  urgent  need  for  advanced  methods 
for  organizing  and  displaying  this  data  in  a  manner  that  is  useful  and  intuitive  for 
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biologists.  One  possible  way  of  organizing  is  to  group  together  the  genes  that  are 
expressed  similarly  in  similar  environments. 

Since  most  cellular  processes  are  step-by-step  processes  [28]  where  the  products 
of  one  step  initiate  and  control  the  activity  in  the  next  step,  one  would  expect  that 
genes  that  regulate  each  step  would  be  expressed  together.  Thus  genes  that  belong 
to  a  common  metabolic  pathway  would  be  highly  correlated  in  their  expression. 
This  in  turn,  means  that  clustering  genes  according  to  their  expressions  in  different 
environments  would  group  together  genes  that  have  similar  function. 

The  data,  obtained  from  [14],  consists  of  the  log  ratio  of  the  expression  levels 
of  2467  genes  during  8  experiments.  For  each  experiment,  expression  levels  are 
measured  at  different  time  points  (differing  number  of  time  points  for  each  exper¬ 
iment)  giving  a  79-dimensional  vector  of  expression  data  for  each  experiment  and 
time  point. 

Eisen  et.al.  have  used  a  pair-wise,  bottom  up  approach  to  cluster  yeast  genes 
in  [15].  Initially,  all  genes  belong  to  individual  cells.  Then  the  two  closest  genes  (in 
terms  of  a  distance  function)  are  chosen  and  their  cells  are  merged  and  replaced 
by  the  average  of  the  two  genes.  Then  this  process  is  applied  repeatedly  until  all 
cells  have  been  merged  into  the  root  node.  The  result  is  then  displayed  as  a  tree 
and  genes  relating  to  similar  functions  are  shown  to  cluster  together. 

We  will  use  a  similar  clustering  method  to  reveal  similarities  of  expression  in 
genes  of  similar  functional  origin.  However  our  method  will  be  top  down,  in  that 
we  start  out  with  a  single  cluster  containing  all  the  vectors  which  is  then  split 
successively  to  create  the  tree.  In  addition,  a  multi-resolution  analysis  is  used  to 
represent  the  expression  vector  for  each  gene  in  different  levels  of  detail.  The  higher 
splits  in  the  tree  are  done  on  the  basis  of  lower  resolution  vectors  and  additional 


73 


Figure  6.3:  Clustering  tree  for  yeast  gene  expression  profiles 

resolution  is  used  when  needed  in  the  subsequent  splits. 

Representing  the  expression  profile  of  each  gene  in  different  resolutions  has  the 
advantage  that  we  can  easily  see  those  experiments  whose  expressions  are  crucial 
for  clustering  genes  similar  in  certain  ways.  A  low  resolution  representation  of  the 
expression  profile  gives  a  broad  view  of  the  expression  of  the  gene  while  details 
might  be  needed  for  better  discrimination  between  genes  with  similar  functions. 

6.3.1  Data  analysis  methods  and  results 

The  79  dimensional  expression  vector  for  each  gene  was  represented  in  6  resolu¬ 
tions  using  a  Daubechies  4th  order  wavelet  filter.  The  tree  growing  algorithm  is 
implemented  as  discussed  in  Chapter  4.  A  large  enough  rate  constraint  was  put  so 
that  each  leaf  node  ends  up  with  a  relatively  small  number  (less  than  10)  number 
of  vectors.  Fig.  6.3  shows  that  structure  of  the  resulting  tree. 

To  study  the  clustering  of  genes  according  to  function,  we  plot  only  those  nodes 
in  the  tree  that  contain  at-least  one  gene  of  the  particular  function.  For  example, 
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Figure  6.4:  Occurences  of  ribosomal  genes  in  clustering  tree 

Fig.  6.4  shows  all  genes  that  contribute  to  ribosomal  function.  From  these  plots 
we  can  notice  some  significant  features. 

Fig.  6.4  shows  that  ribosomal  genes  are  very  strongly  correlated  in  their  ex¬ 
pression.  Furthermore,  it  is  apparent  that  there  are  five  distinct  clusters  of  such 
genes.  Genes  in  the  same  cluster  behave  similarly  to  each  other,  but  each  cluster 
is  distinctly  different  from  another  cluster. 

Another  kind  of  clustering  behavior  is  noticable  when  we  look  at  Fig.  6.5  which 
shows  genes  that  have  functions  related  to  protein  synthesis.  Here  we  see  that  there 
are  several  strong  and  weak  clusters  along  with  many  genes  that  are  not  part  of 
any  cluster.  This  is  the  kind  of  clustering  that  is  mostly  seen  when  looking  at 
functions  that  occur  under  many  different  environmental  conditions. 

Of  course,  there  are  some  classes  of  genes  that  do  not  cluster  at  all.  Fig.  6.6 
shows  genes  that  have  a  Helix- Turn-Helix  structure.  Such  structural  features  are 
not  expected  to  show  themselves  in  the  expression  profiles  and  are  therefore  hard 
to  cluster. 

In  this  section  we  have  shown  how  un-supervised  clustering  can  be  used  to 
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analyze  and  present  data  from  gene  expression  experiments  to  display  how  genes 
are  organized  according  to  function.  Such  methods  will  be  of  great  importance 
in  the  future  to  deal  with  the  flood  of  information  expected  to  be  available  from 
current  biological  experimental  tools. 

6.4  Lymphoma  prediction  using  tree  classifiers 

In  this  section  we  present  an  application  of  tree  structured  classifiers  for  lym¬ 
phoma  prediction.  Alizadeh  et  al  [17]  have  explored  the  use  of  gene  expression 
analysis  for  prediction  of  cancerous  cells.  We  will  use  the  same  dataset  and  use 
tree  structured  classifiers  with  multi-resolution  analysis  for  classifying  cancerous 
from  non-cancerous  cells. 

We  have  the  expressions  of  4096  genes  from  98  different  cell  types.  Of  these 
98,  72  are  cancerous  while  26  are  non-cancerous.  We  are  interested  in  finding  out 
which  genes  are  most  predictive  of  lymphoma  through  their  expressions. 

To  rank  gene  expressions  according  to  their  discriminative  power,  we  use  the 
Fischer  discriminant  [19].  For  scalar  observations  x ,  Fischer  proposed  the  following 
measure  of  separation  between  observations  for  class  1  from  class  2 

(p-i  -  /x2)2 

+  <j\ 

where  /a-i,  /x2  are  the  means  of  the  observations  belonging  to  class  1  and  class  2 
respectively  and  eri,  <t2  the  variances.  The  larger  the  value  of  F,  the  more  separated 
are  the  probability  densities  of  the  two  classes  in  that  feature  space. 

We  order  the  4096  genes  according  to  the  Fischer  discriminant.  The  best  6 
genes  are  chosen  and  a  7-tli  order  multi-resolution  representation  is  created  by 
taking  the  vector  composed  of  the  expressions  of  the  best  k  genes.  Thus,  from 
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the  list  of  gene  expressions  sorted  according  decreasing  Fischer  index,  we  choose 
the  first  k  genes  and  concatenate  them  into  a  k  resolution  representation.  As  k 
increases,  we  append  genes  one  after  another  from  the  list.  Fig.  6.7(a)  shows  the 
points  in  the  two  classes  in  the  space  of  the  best  two  features.  A  classification  tree 
is  created  by  using  LVQ  to  split  each  node  in  a  greedy  fashion.  Since  we  have  all 
the  data  on  hand,  we  need  not  use  the  online  algorithm  as  presented  in  Chapter 
5.  Fig.  6.7(b)  shows  an  example  of  a  tree  created  by  using  the  greedy  algorithm. 
A  10-fold  cross  validation  was  performed  to  find  the  average  error.  We  summarize 


(a)  Distribution  of  the  two  classes  accord¬ 
ing  to  best  two  genes 


(b)  Example  of  tree  classifier  from  greedy 
growing  algorithm.  Solid  lines  are  cancer¬ 
ous,  dotted  non-cancerous 


the  result  in  Table  6.1.  LVQ  is  a  flat  (non- hierarchical)  LVQ  classifier  using  all  the 
expressions  from  all  the  genes  and  SVM  is  a  Support  Vector  Machine  classifier. 
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Table  6.1:  Average  prediction  error  on  lymphoma  data 


Type  of  classifier 

Avg.  error  on  test  set 

MRTSVQ 

0.085 

LVQ* 

0.021 

SVM* 

0.0104 

*  Bhamidipati  [4] 

6.5  Application  to  wear  prediction 


Trying  to  predict  the  wear  of  a  tool  from  the  sound  (equivalently,  the  vibrations) 
it  makes  is  useful  in  real-time  monitoring  of  machinery  to  detect  faults  as  and 
when  they  occur,  rather  than  wait  until  the  next  maintenance  period.  This  way, 
unnecessary  maintenance,  as  well  as  long  runs  in  a  faulty  condition,  can  be  avoided. 
In  the  case  of  a  cutting  tool,  trying  to  cut  with  a  blunt  tool  can  lead  to  the 
breakage  of  the  tool  and  degradation  of  the  job,  while  pulling  the  tool  off  for 
frequent  assessments  are  expensive  in  terms  of  the  machinist’s  time. 

We  use  two  auditory  filters,  developed  by  Shamma  et.ah,  for  preprocessing  [47], 
[48].  The  first  one  is  a  model  of  the  filter  banks  and  nonlinear  operations  that  take 
place  in  the  inner  ear  [48].  The  second  filter  mimics  the  analysis  of  the  filtered 
signal  that  take  place  in  the  primary  auditory  cortex  and  has  been  described  in 
detail  in  Chapter  3. 

6.5.1  Inner  Ear 

This  filter  (Fig  6.7)  describes  the  mechanical  and  neural  processing  in  the  early 
stages  of  the  auditory  system.  In  the  Analysis  Stage,  a  bank  of  constant-Q  filters, 
approximates  the  function  of  the  eardrum  and  the  basilar  membrane  in  the  cochlea 
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eardrum  cochlea  basilar  membrane  filters  hair-cell  stages 


lateral  inhibitory  network 


Figure  6.7:  Spectral  processing  of  sound  stimuli  in  the  inner  ear 


with  the  continuous  spatial  axis  of  the  cochlea  as  the  scale  parameter.  Another 
way  to  interpret  the  output  of  the  cochlear  filters  is  as  an  affine  wavelet  transform 
of  the  stimulus.  The  Transduction  Stage  models  the  conversion  of  the  mechanical 
displacements  in  the  basilar  membrane  into  electrical  activity  along  a  dense,  to¬ 
pographically  ordered  array  of  auditory  nerve  fibers.  This  conversion  can  be  well 
modeled  by  a  three-stage  process  consisting  of 

1.  a  velocity  coupling  stage  (time  derivative), 

2.  an  instantaneous  non-linearity  describing  the  opening  and  closing  of  the  ionic 
channels  and 

3.  a  low-pass  filter  with  a  relatively  short  time  constant  to  describe  the  ionic 
leakage  through  the  hair  cell  membranes. 

The  third  stage  called  the  Reduction  Stage  effectively  computes  an  estimate 
of  the  spectrum  of  the  stimulus,  through  a  lateral  inhibitory  network  (LIN).  The 
details  can  be  found  in  [48]. 
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The  output  from  this  filter  is  analyzed  into  multiple  resolutions  by  the  filter 
that  models  the  primary  auditory  cortex. 

6.5.2  Description  of  the  data 

The  data-set  consists  of  accelerometer  readings  from  the  spindle  head  for  three 
cases. 

1.  0.5” dia.  mill  cutting  a  steel  job, 

2.  0.5” dia.  mill  cutting  a  titanium  job  and 

3.  1.0” dia.  mill  cutting  a  steel  job. 

In  each  case,  the  speed  of  the  mill  is  different,  varying  from  344  rpm  for  case  3  to 
733  rpm  for  case  2.  All  three  cases  differ  widely  in  the  character  of  sound  as  well 
as  behavior  over  short  and  long  time  scales.  Case  1  is  rather  well  behaved,  with 
increase  in  wear  leading  to  a  gradual  change  in  the  frequency  characteristics  of  the 
sound.  Indeed,  one  can  find  harmonics  around  8kHz  that  increase  in  power,  more 
or  less  monotonically,  as  the  tool  life  increases. 

Cases  2  and  3  are  much  less  easily  analyzable.  They  show  episodes  of  high- wear- 
rate  when  the  sound  character  markedly  changes,  interspersed  through  periods  of 
quiet  cutting  when  the  tool  seems  to  behave  ideally.  There  are  no  easy  pointers 
like  the  8kHz  harmonic  in  Case  1. 

The  sound  sample  from  each  pass  of  each  tool  also  includes  a  sync  file,  which 
gives  the  beginning  of  each  revolution  of  the  tool  in  terms  of  sample  numbers.  Also 
2-3  wear  measurements  are  given  for  the  lifetime  of  each  tool. 
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6.5.3  Preprocessing  and  training 

The  sound  samples  from  the  accelerometer  were  cut  up  into  frames,  each  frame 
being  the  sound  from  one  revolution  of  the  tool.  Each  such  frame  was  re-sampled  to 
4096  samples,  normalized  to  zero  mean  and  unit  variance,  and  further  subdivided 
into  four  sub-frames  each.  Each  such  sub-frame  would  correspond  to  one  geometric 
quarter  of  the  tool,  or  one  flute. 

All  sub-frames  are  then  passed  through  the  inner  ear  and  auditory  cortex  filter 
to  obtain  a  set  of  multi-resolution  vectors  describing  the  timbre  spectrum  of  the 
sound.  We  use  an  average  of  four  sub-frames  (each  belonging  to  one  revolution)  as 
our  observation  vector.  These  vectors  were  then  used  as  input  to  the  tree  growing 
algorithm. 

We  utilize  the  class  labels  in  growing  the  TSVQ,  by  building  a  tree  for  each  class, 
using  only  the  appropriately  labeled  data.  This  method,  usually  called  Parallel 
TSVQ,  gives  better  results  than  making  one  tree  for  all  the  classes  combined.  In 
the  combined  tree,  an  initial  wrong  misclassification  into  one  particular  sub-tree 
can  end  in  a  vector  being  incorrectly  classified.  This  problem  is  avoided,  to  a  great 
extent,  in  the  parallel  case.  The  Parallel  TSVQ  is  also  quicker  to  execute  when  we 
have  a  large  number  of  classes.  Testing  on  each  tree  can  be  done  in  parallel,  which 
reduces  computational  time. 

The  tree  growing  algorithm  we  used  is  similar  to  the  algorithm  used  in  [3]  that 
we  have  described  earlier.  This  is  essentially  similar  to  the  algorithm  we  have 
presented  in  this  thesis  except  that  the  stopping  criterion  is  a  constraint  on  the 
number  of  leaves  on  the  tree  rather  than  the  rate  of  the  tree. 
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6.5.4  Testing 


Testing  was  done  on  data  belonging  to  all  three  cases.  Tools  that  had  not  been 
used  in  training  were  used  in  the  testing  procedure.  The  preprocessing  was  similar 
to  what  was  done  for  training.  Each  vector  was  dropped  down  all  five  trees  and 
the  distance  to  the  centroids  of  the  leaf  nodes  it  fell  into,  was  compared.  The 
vector  is  assigned  a  wear-class  according  to  the  wear  level  of  the  tree  that  gives 
the  least  distance  from  the  centroid  to  the  vector.  This  way,  we  get  a  time  series 
of  wear-class  prediction  for  all  the  frames  for  all  the  passes.  Next  we  take  a  sliding 
window  of  500  frames  and  find  the  mean  wear  estimate  for  this  window.  This  gives 
an  estimate  of  the  changing  wear  at  different  times. 

For  the  case  of  a  0.5”  dia.  tool  cutting  4340  steel,  the  plots  of  mean  wear 
estimates  vs.  tool  life  is  shown  in  Figs  6.8(a)  and  6.8(b),  for  different  tools.  It 
is  apparent  that  our  method  has  picked  up  features  in  the  sound  that  seem  to  be 
correlated  to  the  tool-life  and  the  wear  of  the  tool.  The  periodic  variation  in  the 


Tool  SI 


(a)  Average  (solid)  and  variance  (dashed) 
of  wear  vs.  tool  life  for  SI 
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(b)  Average  (solid)  and  variance  (dashed) 
of  wear  vs.  tool  life  for  S21 


wear  estimate  is  a  result  of  the  different  passes.  The  actual  sound  made  by  the  tool 
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is  not  just  a  function  of  the  state  of  the  tool,  but  also  depends  on  the  position  on 
the  job  the  tool  is  presently  at.  The  wear  estimate  at  the  start  and  end  of  the  pass 
are  higher  than  that  in  the  middle.  This  could  correspond  to  the  observation  that 
the  wear  rate  at  the  starting  and  ending  of  the  job  is  higher  than  in  the  middle.  A 
tool  wear  model  incorporating  wear  information  from  the  sound  is  used  for  wear 
prediction  in  [44], 

Fig  6.8(c)  shows  the  results  of  testing  a  Case  2  (0.5”  dia.  tool  cutting  titanium 
job)  tool  data  on  the  tree  trained  using  Case  1  data  only.  Here  also  we  see  the 
gradual  increase  in  the  tool  wear,  though  the  way  it  increases  is  different  from  that 
in  Case  2.  Fig6.8(d)  shows  the  results  of  the  classification  on  a  tool  from  Case  3 
(1.0”  dia.  tool  cutting  4340  steel). 


Tool  Ti4  Tool  1S2 


xIO4  xIO4 


(c)  Average  (solid)  and  variance  (dashed)  (d)  Average  (solid)  and  variance  (dashed) 
of  wear  vs.  tool  life  for  Ti4  of  wear  vs.  tool  life  for  1S2 
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6.6  Fingerprint  identification 


In  this  section  we  will  investigate  the  problem  of  storing  fingerprint  images  for 
fast  retrieval  and  identification.  A  set  of  fingerprints  takes  about  10  megabytes  of 
storage  in  the  raw  form.  The  FBI  has  roughly  200  million  fingerprints  on  record 
and  storing  all  of  them  in  the  raw  form  would  require  around  200  terabytes  of  disk- 
space.  In  addition,  whenever  a  new  fingerprint  is  obtained,  it  has  to  be  compared 
with  each  print  in  the  archive  to  find  out  if  there  is  a  match. 

The  Wavelet /Scalar  Quantizer  (WSQ)  standard  [6]  was  designed  for  lossy  com¬ 
pression  of  fingerprints  for  archival  and  transmission  purposes.  It  uses  a  wavelet 
preprocessor  followed  by  scalar  quantization  and  Huffman  coding  to  reduce  the 
data  by  approximately  1:12.  When  a  new  fingerprint  is  obtained,  it  is  compared 
with  all  the  decoded  fingerprints  in  storage  and  a  match  is  found. 

Much  of  the  time  taken  in  identifying  a  new  fingerprint  is  the  processing  time 
associated  with  uncompressing  each  fingerprint  in  the  database  and  comparing  it 
to  the  new  print.  It  is  our  contention  that  this  step  is  unnecessary  and  leaving  the 
prints  in  the  multi-resolution  domain  will  speed  up  the  search  process  in  two  ways 

1.  Computation  required  for  uncompressing  each  print  is  saved. 

2.  Multi-resolution  tree  structured  arrangement  of  prints  will  decrease  search 
time  to  less  than  0(log(/Q)  where  k  is  the  number  of  prints  in  the  database. 

To  illustrate  these  advantages,  we  store  25  fingerprints  from  the  NIST  8-Bit  Gray 
Scale  Images  of  Fingerprint  Image  Groups  (FIGS)  sample  set  [18]  using  an  MRTSVQ 
We  show  that  the  number  of  computations  needed  to  identify  a  given  fingerprint 
is  reduced  considerably  as  compared  to  a  brute  force  search.  This  process  is  called 
authentication  since  we  know  that  we  already  have  the  test  fingerprint  in  our 
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database  and  all  we  want  is  to  search  for  it  in  the  fastest  possible  way. 

There  is  also  the  complimentary  problem  of  verification  where  we  have  an 
unknown  test  fingerprint  and  we  want  to  see  if  we  have  this  print  in  our  database. 
This  problem  is  harder  than  authentication  because  we  will  need  to  extract  features 
like  loops,  whorls,  arches  and  minutiae  from  the  test  print  to  find  the  fingerprint  in 
our  database  that  it  most  closely  corresponds  to.  This  problem  has  been  addressed 
in  [23]  and  [27].  Since  extraction  of  these  features  is,  technically  speaking,  not 
related  to  classifier  design,  we  will  not  deal  with  that  problem  here. 

6.6.1  Fingerprint  encoding  and  tree  growing 

We  use  a  Bi-orthogonal  wavelet  filter  with  3rd  order  reconstruction  filter  and  7th 
order  decomposition  filter  (’bior3.7’  in  MATLAB)  to  represent  each  fingerprint  in 
a  multi-resolution  form.  There  are  6  resolutions  and  at  the  lowest  resolution,  each 
image  is  represented  by  a  set  of  484  coefficients  while  at  the  highest  resolution  the 
dimension  is  64961.  The  original  size  of  each  fingerprint  is  480  x  512  pixels  with 
8-bit  gray  level.  Tree  growing  is  accomplished  according  to  the  algorithm  given 
in  Chapter  4.  The  decision  to  split  at  a  higher  level  is  taken  when  the  fractional 
decrease  in  distortion  of  the  cell  is  lesser  than  a  value  5d  that  is  fixed  before 
executing  the  algorithm.  Larger  values  of  5d  force  the  algorithm  to  seek  higher 
resolutions  early  on  in  the  tree  growing  process,  thus  increasing  the  computational 
complexity  of  the  tree  search.  Lower  values  allow  the  algorithm  to  remain  in  the 
low  resolution  space  for  a  longer  time,  resulting  in  trees  with  lower  computational 
complexity  for  search  operations. 

Fig.  6.8(e)  shows  an  example  of  a  tree  generated  by  the  greedy  algorithm. 
Darker  branches  correspond  to  higher  resolutions. 


(e)  Tree-structured  organization  of  finger¬ 
print  data:  Darker  branches  are  at  higher 
resolution 


Value  of  8  d 


(f)  Computational  complexity  vs.  6d 


Defining  the  computational  complexity  of  the  tree  as  the  average  number  of 
multiplications  for  distance  computation  Fig.  6.8(f)  shows  the  experimental  com¬ 
putational  complexity  for  each  value  of  5d.  The  complexity  increases  sharply  with 
5d  but  is  much  smaller  than  the  complexity  of  the  brute  force  search. 
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Chapter  7 


Conclusions  and  further  research 

7.1  Conclusions 

In  the  previous  chapters  we  have  presented  algorithms  for  achieving  compression 
and  classification  of  signals  represented  in  multiple  resolutions.  We  have  presented 
a  greedy  algorithm  for  quantization  of  multi-resolution  signals  and  showed  that 
under  some  conditions  the  expected  distortion  will  go  to  zero  asymptotically  as 
the  number  of  iterations  increase  to  infinity.  The  condition  for  this  is  that  the 
dimensionality  of  the  MRA  should  not  increase  faster  than  the  logarithm  of  the 
depth.  We  have  also  shown  that  under  a  rate  constraint,  the  algorithm  will  stop 
in  finite  time. 

We  have  derived  an  online  algorithm  for  constructing  quantizers  and  shown 
that  it  minimizes  a  cost  that  is  exactly  the  distortion  error. 

For  classification  we  have  introduced  an  extension  of  LVQ  for  creating  a  multi¬ 
scale  tree.  This  algorithm  adapts  the  centroids  of  the  tree  as  a  whole  to  adapt  to 
a  sequence  of  training  samples  of  observation  vectors  and  class  labels.  We  have 
shown  that  such  an  algorithm  will  converge  to  an  equilibrium  solution  that  is  the 


equilibrium  for  an  associated  ODE. 

Tree  structured  classification  and  clustering  methods  have  wide  applicability 
in  most  problems  where  the  relationships  between  data-points  have  a  hierarchical 
structure.  This  is  more  general  than  appears  at  first  sight.  In  most  problems,  we 
implicitly  assume  that  data-points  lying  near  each  other  are  more  closely  related 
than  points  lying  far  from  each  other.  Thus,  if  we  consider  a  sequence  of  nested 
neighborhoods  of  a  point,  there  is  a  hierarchy  of  relationships  between  it  and  a 
point  in  a  neighborhood.  Thus  hierarchical  relationships  are  very  common  in  real 
world  data.  [43]  explains  in  more  detail  why  it  is  reasonable  for  most  data  to  have 
a  hierarchical  structure. 


7.2  Future  research 

7.2.1  Combined  compression  and  classification 

Combining  the  compression  and  classification  criteria  is  motivated  by  the  necessary 
tradeoff  between  the  contrary  requirements  for  good  compression  compared  to  good 
classification  performance.  An  example  that  we  considered  in  the  previous  chapter 
is  fingerprint  archival  and  classification.  Storing  fingerprints  in  the  raw  form  takes 
terabytes  of  storage  so  we  would  be  interested  in  finding  ways  of  compressing  these 
images.  But  we  cannot  compress  them  so  much  that  we  lose  vital  information 
necessary  for  discriminating  between  different  prints. 

This  example  is  typical  of  applications  where  we  not  only  need  to  reduce  the 
amount  of  storage  required  but  also  do  the  compression  in  a  way  that  does  not  lose 
the  information  necessary  for  classifying  a  new  observation.  The  first  formulation 
of  this  problem  and  its  solution  by  VQ  was  by  Perlmutter  et.  al.  [36].  They  use  the 


Nearest  Neighbor  criterion  combined  with  majority  rule  and  a  generalize  centroid 
to  develop  an  iterative  algorithm  that  seeks  to  minimize  a  weighted  combination 
of  the  classification  error  and  the  distortion. 

Baras  and  Dey  [2]  present  an  LVQ-like  algorithm  for  online  minimization  of 
a  combined  classification  and  compression  cost.  They  prove  that  this  algorithm 
converges  to  the  equilibrium  points  of  an  associated  ODE. 

It  has  not  yet  been  shown  whether  the  distortion  and  classification  errors  go 
to  the  minimum  possible  as  the  number  of  centroids  increase.  Also  there  have 
been  no  results  forthcoming  on  how  changes  in  the  relative  weighting  between  the 
classification  error  and  the  compression  error  changes  the  asymptotic  behavior  of 
the  algorithm. 

7.2.2  Convergence  of  LVQ  for  local  equilibria 

In  Chapter  5  we  have  shown  how  the  multi-scale  LVQ  algorithm  converges  to  the 
global  equilibrium  of  the  associated  ODE.  A  far  more  realistic  assumption  is  that 
there  are  more  than  one  non-global  equilibria.  In  such  a  case,  our  results  do  not 
hold. 

One  way  to  solve  this  would  be  to  use  a  Ljung-type  convergence  theorem  that 
says  that  if  the  centroids  visit  the  domain  of  attraction  of  a  local,  asymptotically 
stable  equilibrium  an  infinite  number  of  times  then  the  algorithm  will  converge  to 
that  equilibrium. 
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Appendix  A 


Geometric  result 

In  Chapter  4  we  used  the  following  geometric  result, 

Lemma  14  Consider  two  concentric  balls  B 1  =  B(0,  r)  and  B2  =  B(0,ry/d)  in 
a  d-dimensional  space.  Now  take  a  hyper-plane  H  in  this  space  that  is  tangent  to 
the  inner  ball  B\  and  intersects  the  outer  ball  in  a  hyper-circle,  thus  dividing  the 
surface  of  the  outer  ball  into  two  parts,  V\  and  V2  with  V  ol(Vf)  <  Vol(V2).  Then, 
we  have 

vMyp  i  vi 

Vol(V,)  +  Vol(V2)  2V2^ 

Proof:  Fix  a  dimension  d.  The  surface  area  of  a  sphere  in  d  dimensions  is 

7T  Lf  J  2drd~l 

(d  —  2)(d  —  4) . . .  (1  or  2)  (A,1) 

The  hyper-plane  H  intersects  B-2  in  a  hyper-circle  such  that  the  surface  area  of  the 
sphere  enclosed  within  this  hyper-circle  is  the  same  as  that  enclosed  by  a  cone  with 
vertex  at  the  center  of  the  sphere  and  angle  a  =  cos^l/y^d))  (see  Fig.  A.l). 
This  surface  area  can  be  computed  as  (omitting  tedious  details) 

(d-  3)(d  -  S) .  /<i-2(cos~1(1/v/M))  (A.2) 
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Figure  A.l:  Illustration  of  B i,  B2  and  H  in  3-dimensional  space 


where  /fc(a)  =  sink(x)dx 

Taking  the  ratio  of  the  surface  enclosed  by  the  cone  A. 2  to  the  total  area  A.l 
we  get 


R(d )  =  —  cos 

7 T 


l  /  i  \  Vd=i 


- 


\fd)  vr  d 


1  + 


2  (d-  1\  2  x  4  (d-  1 


d 


3  x  5  V  d 


+ 


2  x  4  x  . . .  x  (rf  -  A)  ( d  —  1  \  2 


3  x  5  x  ...  x  (d  -  3)\  d 


(A.3) 


if  d  is  even  and 


m  =  l  ' 


2  2  yfd 


1+  2 


1  fd-  1\  lx3/d-l 


d 


+ 


2  x  4  V  d 


+ 


1  x  3  x  ...  x  (d  -  4)  f  d  —  1 


d-3  • 
2 


(A.4) 


2x4x...x(d-3)  \  d 

if  d  is  odd. 

-R(gT)  is  monotonically  decreasing  for  increasing  d,  so  R(d')  >  lim^oo  R(d)  for 
any  d.  Now  we  will  hnd  a  lower  bound  on  lim^oo  R(d) 
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It  is  enough  to  find  the  above  limit  for  d  taking  only  odd  values.  A. 4  can  be 


written  as 

R(d)  =  --  ~^=S(d) 
{  J  2  2  Vd 

where 


+ 


lx3/d-l 


2x4 


d 


+  . . .+ 


1  x  3  x  ...  x  (d  —  4)  f  d  —  1 


2  x  4  x  ...  x  (d  —  3) 


d 


d—3 

2 


We  can  write  S(d)  =  S\(d)  —  S2 (d)  where 


=  i+i  (Y 


1x3  (d-  1 


2x4 


d 


+. .  .4 


1  x  3  x  ...  x  (i  —  4)  Sd  — l 
2x4x...x(i-3)\  d 


i—3 

2 


4-. 


for  an  infinite  number  of  terms  and 


S2(d) 


lx  3  x  ...  x  (d  —  2)  /d—  1  \  2  1  x  3  x  ...  x  (d) 

2x4x...x(d-l)\  d  )  +2x4x...x(d+l) 


d-  1 


d+1 

2 


+. . . 


S'i(d)  can  be  shown  to  be  the  Taylor  expansion  of 


\fd 


while  5*2  (d)  can  be  written  as 

(d  —  1)!  fd^lX^  f  d  (d-  1\  d{d  +  2)  fd-l\2  \ 
((^i)!)^-1  \  d  )  ^  +I+1\  d  )  +  (d+l)(d  +  3)  V  d  )  +  'J 

(d  —  1)!  (d-l\^  (  d  (d- 1\  (  d  \2  fd-l\2  \ 

>  ((^).)W  {—)  y  +  ITT  {— )  +  \JTi)  {—)  +"'J 

(d  —  1)!  f  1  \ 

(br1)!)2”"-1  V  d  J  [l-idj 

(d-l)l  /d-l\^(d  + 1) 
'  ((^i)!)2^-1  \d~ )  2 

where  the  inequality  comes  from  the  fact  that  (d  +  i)/(d  +  i  4-  1)  >  d/(d  4-1)  for 
any  i,d  >  0 
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Using  Stirling’s  formula 


n! 


JjEe-™(n  +  ir' 


for  the  factorial  of  a  sufficiently  large  integer  n,  we  can  simplify  the  lower  bound 
on  S2(d)  as 


S2(d)  > 


1  e(d  +  1)  /  d 


2vr  y/d  \d+l 


d-l\~ 


d 


Then 


I {(d)  >  - —  (  Vd 


2  2  Vd, 

which  simplifies  to 


1  e(d  +  1)  /  d 


27T  y/d  \d+l 


d-  l\-2 


d 


R(d)  > 


e  y/(l  ~  1  /d)d  1  /  1 

2^2n  y/Jl^TJd)  (1  +  1  /d)d  \  d 


Using  the  fact  the  (1  —  1  / d)d  — >  e  1  and  (1  +  1  / d)d  — >  e  as  d  — >  cx)  we  get 

lim  i?(d)  >  — - - 

d-’-oo  2y/27re 


which  gives  the  desired  result. 
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